Informatics 101: Sequencing is a BIG Data Problem

In the early days of next-generation sequencing, I was approached by a local R&D group about designing a cluster for their two new SOLiD instruments.  The budget was modest - tens of thousands of dollars - and the expectations were high - a run every ten days for each instrument, or one run a week for the lab.  Some quick back-of-the-envelope calculations yielded some disappointing results: with the proposed budget, the best solution would be a nice workstation that would take least at 10 days to process a single run.  Even if everything worked perfectly, it simply wouldn't be possible to keep up with the instrument output.

To reset expectations and develop a more appropriate solution, we took a step back and looked at the problem from a systems perspective: how does data move from the instrument through analysis?  What are the main bottlenecks? What else will the cluster be used for? What solutions exist that will allow us to meet our development and production goals?

In the next few blog posts, we'll go through the basic analysis we performed.  Along the way, we'll identify the main components of a sequencing informatics system, discuss the parameters that matter the most when planning a system, and set guidelines for designing a sequencing informatics system to meet your specific requirements.

A Model NGS Pipeline

To start, let's consider the basic steps in a bioinformatics workflow:

  1. Translate raw data into short reads
  2. Map reads to the reference genome
  3. Protocol specific analysis (e.g., expression analysis for RNA-Seq, variant calling for re-sequencing)
  4. Report results

At each step, data are transformed and prepared for the next stage. In between stages, data may be moved between different compute resources. These steps are common enough across sequencing protocols that steps 1-3 are often simply referred to as Primary, Secondary, and Tertiary analysis.

We'll use this basic model to help reason about our sequencing informatics system.

Sequencing as a Big Data Problem

Common sequencing informatics workflow steps  Common sequencing informatics workflow steps

What sets next-generation sequencing apart from many other data collection methods is the sheer amount of data generated by each run. In fact, aside from a couple physics instruments such as the Hubble Space Telescope and the Large Hadron Collider, few other scientific instruments generate as much data as NGS instruments.

From a pure operations perspective, the amount of data at each analysis stage is driven by the output of primary analysis, which generally occurs on the instrument.  A HiSeq run can generate anywhere from a few hundred gigabytes (GB) to a few terabytes (TB) of read data.  Our SOLiD instruments originally generated around 40 GB per run but quickly grew, through improvements in chemistry, to generate over 300 GB.

Secondary analysis maps all the reads against a genome reference and generates a file with one or more entries per read, identifying all the locations the read mapped against the reference.  The type of data stored for each mapped read takes roughly the same amount of storage as the read itself, essentially doubling the data size for the run.  Binary compressed formats are often used for mapped data, reducing the storage requirements.

Tertiary analysis further processes the mapped reads based on the specific protocol.  Results for tertiary analysis are often reported from the perspective of the reference rather than the reads and are much smaller.  For example, RNA-Seq results report how many reads mapped to each gene or transcript.  Even with 80-100k transcripts, basic read count reports are small in comparison to the actual read data.

While it's tempting to focus on the results of tertiary analysis and disregard the data from the other stages, any sequencing informatics system must take into account all working data and the actual, not just ideal, usage patterns.  Secondary analysis is often repeated using different aligners and parameters to help validate results. Pipelines fail and need to be restarted. Bioinformaticians replicate past results to develop new methods. Data retention polices may require reads and other intermediate results to be stored for a period.  As a rule of thumb, any sequencing run will require around 2.5 times the size of the FASTQ file in on-line storage and the data will need to remain available for at least the duration of the project.

 

Instrument Read Length Paried End Reads (millions) FASTQ (GB) Total Data (GB)
HiSeq 2500 36 No 3,000 366 915
100 Yes 6,000 600 1,500
MiSeq 36 No 15 1.5 3.75
100 Yes 30 3 7.5
250 Yes 30 33 82.5
SOLiD 4 110 75+35 1,400 378 945
ION PGM 318 35 No 8 0.96 2.4
200 No 8 3.6 9
400 No 8 6.8 17
ION Proton 1 200 No 80 36 90
ION Proton 2 200 No 160 72 180

 

The table summarizes the basic data scales for different types of instruments. FASTQ file sizes are computed as (Read Length x 2 + 50) x (Reads) where the factor of two accounts for the read and quality values and 50 bytes are added for the identifier (paired end runs have an additional 2x multiplier).  Total data uses the 2.5x multiplier on the FASTQ size. Read length and number of reads are from the product Websites, except for Proton 2, which is just a guess for a future Proton.

Given these results, it might seem prudent to simply just use ION Torrent PGMs or MiSeqs and sidestep the big data problem entirely.  Of course, that ignores another important factor in sequencing: basic biology drives how much data must be collected for a given experiment.  In the next blog post, we'll explore how biology itself is a big data problem.

This entry was posted in Big Data, Bioinformatics, HPC, NGS