In a previous post, we talked about sequencing as a big data problem, with our focus on the scale of data generated by sequencing instruments. In this post, we'll look at the problem from the context of biology to help understand the fundamental reasons why sequencing will always be a big data problem.
We'll derive two main conclusions, one for DNA applications and one for RNA applications:
- For DNA applications, noise and bias drive how much data must be collected
- For RNA applications, the distribution of transcripts drives the sampling depth
Let's look into the basic (and admittedly simplified) math behind these two important conclusions.
The goal of most DNA sequencing experiments is to identify locations where a sample differs from a reference genome. These can be single point mutations (e.g., SNPs, indels), translocations, copy number variations, or differences in the way the bases have been modified by external factors (e.g., methylation). In all cases, each location on the genome must be measured repeatedly until the actual base present at the location can be called with confidence.
A number of factors can lead to different measurements at the same location, including:
- In polyploidal organisms, two or more distinct bases are possible at each location
- Basic mutation rates ensure that over time individual bases will diverge as cells evolve
- Instruments and prep methods can induce read errors or biases
Each of these factors requires oversampling at a particular location to call the base (or bases, for polyploidal genomes) correctly. Most variant calling methods require 10-30x coverage at a given location to correctly call a base.
For a small genome such as E. coli with 5M bases, 10-30x coverage corresponds to 125-375 MB of sequence data (quality scores and meta-data increase the amount of data by about 2.5x compared to number of bases measured). For larger genomes such as human with 3 billion bases, 75-225 GB of sequence data must be collected.
Expression levels are one of the primary measurements made in RNA experiments. Expression levels measure 1) if a transcript is expressed at all and 2) what the relative proportion of that transcript is to the other actively expressed transcripts in the sample. Expression measurements are made by counting the number of reads that map to a given transcript. Highly expressed transcripts will have more reads than rarely expressed transcripts.
Each cell contains a number of active transcripts that represent the current genomic activity in the cell. Active transcripts vary by cell type and the cell's stage in its lifecycle. Copy numbers for transcripts within a cell vary greatly - a few highly expressed genes account for the bulk of the active transcripts in a cell.
Let's consider an RNA preparation from a collection of cells containing 10,000 active genes in which ribosomal RNA has been removed and only transcripts with poly(A) tails are included. The top 20 expressors will likely account for the bulk of the copies of active transcripts and the top 10 expressors account for 95% percent of the active transcripts.
Given 1M reads, 950k reads will map to the top 10 expressors, leaving 50k reads for the remaining transcripts. Of these, 95% will likely represent the next 10 top expressors, leaving 2,500 reads for the next 9,980 transcripts. Assuming the remaining transcripts are present in similar copy numbers, there's only a 1 in 4 chance of seeing a single read from one of them.
Increasing the number of reads to 5M starts to ensure that most transcripts may be counted at least once. However, more than one read is required to ensure detection and provide enough counts for differential expression analysis.
For RNA experiments, this sets a lower bound on the amount of data that must be collected. If the transcripts of interest are highly expressed, then only a few million reads are necessary. But, for rare transcripts, tens of millions of reads are needed.
RNA-Seq is based on read-counting instead of coverage. The goal is to assign read counts to a region of the genome instead of assuring even coverage across the whole genome. For counting experiments, shorter reads can be used, reducing the amount of data and instrument run time. 35-bp reads are generally considered sufficient for counting applications. (as an aside: longer reads and full coverage of transcripts are helpful for identifying transcripts and splice events, but for straight counting, they are not necessary)
Back to our discussion of data scale: For studies that target highly expressed genes where 5M 35-bp reads is sufficient, 450 MB of actual data will be collected per replicate. For studies that target transcripts with fewer copies in any given sample, up to 50M 35 bp reads may be required, or around 4.5 GB of data per replicate. RNA experiments compare multiple samples with multiple replicates, quickly pushing the data scale for a single experiment into the 100s of GBs.
Biology Drives Data Scale
As these two examples show, basic tenets of biology and measurement systems determine the amount of data needed to answer specific questions using sequencing. Without NGS instruments, these types of measurements would not be practical.
As a stopgap to reduce data scales, different prep methods have been developed to target specific regions of the genome. Exome sequencing sequences only DNA in known coding regions. Gene panels have probes for transcripts of interest along with a few reference transcripts to allow for targeted differential expression studies. In a future post, we'll look at how these methods trade depth for specific data and discuss whether or not they offer an improvement over microarrays.