This article is about petri net modelling of MapReduce and Spark Frameworks in Bioinformatics Pipelines. New processing and storage paradigms designed to manage complexity for high volumes of data have been implemented in Bioinformatics pipelines. Among the most popular of these programming paradigms are MapReduce and Spark. In a complex distributed environment, for example cloud, it is important to determine the correctness of data processing. In such a case there may be several design challenges in the architecture of the employed paradigms. While a lot of studies have provided formalisms for MapReduce and its implementation paradigms, there is limited data available for these applications in the context of Bioinformatics. Bioinformatics tools based on MapReduce decompose tasks into a series of Map and Reduce steps. Halvade, for example, enables sequencing pipelines to be executed in parallel on a multi-node infrastructure. It has a Map step for alignment and a Reduce step, for variant calling. Likewise, Myrna is another example of such tools.
In the MapReduce programming model, the tasks are broken down into Map and Reduce steps. However, some iterative tasks, such as clustering are not effective for this model and performs better if we use Apache Spark, which is a general purpose computing framework. Spark performs these functions by performing the in-memory computation. Spark’s transformations are complex and well suited for iterative tasks. YARN manages clusters which can be integrated with Hadoop framework. Falco, a single-cell RNA-sequencing framework is a Spark based tool. Frequently, dense data is created by bioinformatics workflows and consequently the scalability becomes an issue.
In bioinformatics workflows, Pipe and Filter architecture is used to process the big data. Almost every computational process begins with sequence similarity searches at Gene/DNA, RNA, or protein level. One or multiple strings, taken as query sequences, are looked for matches inside a large reference database of genomes/proteomes. Another layer, Divide and Conquer, is applied for parallelization of tasks as shown in Figure 1.
Figure 1: Pipe and Filter in Divide and Conquer Architecture
Numerous studies have formalized MapReduce from various aspects. There are theoretical comparisons of MapReduce implementations, such as Hadoop and Spark that highlight how one implementation is different from another, in a given scenario. Fluid Petri net (FPN) models have been used to estimate at runtime MapReduce and Spark jobs execution times. While these analyses have been conducted for applications having Map and Reduce functions, there are no formalisms available for these applications in the context of Bioinformatics pipelines.
What can be Done?
Petri nets are a general purpose graphical modelling language for the engineering of concurrent systems. Based on a sound theoretical background, Petri nets provide clear graphical representations of complex workflows thereby enhancing understanding of the processes as they provide the most compact representations of system entities and their behavior. Moreover, the Petri net models help determining performance of MapReduce based applications within the cloud environments.
In this article, we review literature to find out tools based on MapReduce and Spark that are used in Bioinformatics pipelines and subsequently, we develop Petri net models. A Petri net being a directed graph comprises of places, transitions, and directed arcs used to connect places to the transitions and vice versa. Two places or two transitions cannot be connected simultaneously. A transition is permitted if all of its inputs fulfil the pre-conditions necessary to leave the previous place and approach the next place.
We make use of the existing petri net models to create two models to compare how differently a particular job is carried out by a MapReduce based and a Spark based application. We first created a reference layout for understanding the working behind MapReduce and Spark based tools, used in bioinformatics pipelines. Usually, splitting the tasks between the CPUs similar to symmetric multiprocessing SMPs is performed to parallelize workflows in bioinformatics. Cluster computing has been widely in use where every cluster has its own hard drive and memory, therefore, creating applications to work in such environments becomes a challenge because such applications need to work with PVMs (Parallel Virtual Machines) and MPIs (Message Passing Interfaces). Examples of such tools in bioinformatics are mpiBlast that segments the search database and MASON, both of which reduce the number of tasks performed at each node, thereby reducing the time and complexity of computations.
Figure 2: Big data Analytics in Bioinformatics: Halvade, MapReduce Pipeline for Genomic Variant Calling
In bioinformatics, variant calling is a sub-process of big data analytics by which we identify variants from sequence data. This task is being performed with the help of two tools: MapReduce based Halvade and Spark based Falco, as described in Figure 2 and Figure 4, respectively. Variant calling has more than one implications, for example, predicting the impact of variations on protein structure or function, analyzing the effect of variations in evolution, etc. Halvade first decomposes the tasks into a series of map and reduce steps. It performs alignment of a given sequence with a reference genome, writes matching results in one file, and creates another file to record variants. Falco does not decompose into map and reduce; it transforms data into simpler functions (much like, but not exactly like MapReduce), and later combines the outputs to one final result. Its in-memory computation ability helps it to synchronize sub-tasks easily. Its multiple deployment strategy also helps it to integrate the process with Hadoop.
Figure 3: Hadoop/MapReduce Tasks
After the SetUp task, Map and Reduce follow. The Master resource manages all Worker resources and distributes the information. It is also conferred with the responsibility of task allocation and tracking of performance. In this way, petri nets allow for modelling the workflow by keeping parallelism and synchronization in consideration. It also makes it easier to learn the preconditions of each transition. To represent these processes graphically, we create Places, Transitions, and directed Arcs for out petri nets. Hadoop implements the Map-Reduce paradigm in two tasks that are: (i) Map and (ii) Reduce tasks. Moreover, various subtasks within the Map and Reduce that need to be modeled in Hadoop are also implemented as shown in Figure 3.
Figure 4: Big data Analytics in Bioinformatics: Falco, Spark Pipeline for Single Cell Transcriptome feature Quantification
TABLE I: Preconditions for each Transition
|Shuffle||Worker available. Map phase completed|
|Sort||Shuffling complete. Ready for Sort.|
|Reduce||Sorting complete. Ready for Reduce|
|Output||Reduce phase completed|
|Clean||Output phase completed for every initialization|
For modeling using the petri nets, we consider two tools used in bioinformatics: Halvade based on MapReduce paradigm and Falco based on Apache Spark framework.
The model of the MapReduce paradigm considers N users that can submit jobs to the system after a setup time Z. The submission of jobs from a user is modeled by the ﬁring of transition SETUP characterized by the inﬁnite server. The model of Spark based job evidently has better utility for bioinformatics jobs because it reduces the jobs by keys thereby making available more than one blocks to enable the execution of the new or pending jobs.
Figure 5: Petri Net Model of a MapReduce Job as described in Figure 2
It takes five places and five transitions for a MapReduce tool to perform a sequence similarity search task in a pipeline. In figure 6, Halvade, a MapReduce based analysis tool for read alignment and variant calling calls its Map and Reduce functions. Its main tasks are split into subtasks depending upon availability of blocks, and number of users trying to access the service over the cloud. Halvade uses the BWA tool to align sequences to a reference genome in the Map step. In the Reduce step it searches for regions of chromosomes via the GATK tool, for variant calling. Iterative tasks, such as clustering, are not compatible with the MapReduce framework. To overcome this issue, Spark based tools have been employed that has cluster managers e.g., YARN for population-scale clustering. This is implemented in Falco, a Spark based tool for RNA sequencing for a single cell. Spark also facilitates deployment modes from local to standalone, to cluster mode.
Figure 6: Model of a Spark based Job as described in Figure 4
It first aligns the target sequence with a sample of reference sequences, but also performs counting of genes per sample. It also performs gene counting across samples from a given population of reference sequences. As Figure 7 shows, this is completed in six places and seven transitions. Spark’s in memory computing, shuffling, followed by reduction by keys are primary reasons for its efficient handling of clustering tasks, which makes big data analytics more compatible with Spark-based, instead of MapReduce-based applications.
Petri-nets provide a mature formalism for visualizing big data analytics processes. Parallelized workflows are at the core of every big data analytics scheme and one such example is Bioinformatics pipelines. Hadoop is an implementation of MapReduce paradigm for parallelized tasks, especially over the cloud. Petri nets not only are useful for graphical representation but also have the expressiveness to represent parallelism and synchronization. Map and Reduce functions decompose the tasks into a series, which is not implantable with clustering tasks. On the other hand, Spark provides transformations that can be chained together, without the need of decomposing into map and reduce tasks. Petri nets of Halvade and Falco have been created in this article for graphical representation of these workflows that enhances comprehension about their performance in terms of scalability.
Read more about big data.
About the Author
Tayyaba Iftikhar did her Masters in Software Engineering from COMSATS University Islamabad in 2020. Her interests are Application Design and Architecture, Global Value Chains and Trade in Services. She is a civil servant and currently working with Pakistan Institute of Trade and Development.