This is a very common question NGS users must address for successful de novo genome assembly and there often isn’t a single correct answer. A miscalculation here can easily lead to very slow or stalled assemblies. However, if we make some assumptions on data type and desired coverage, we can come up with a starting point.
Most stringent de novo genome assemblers, like SeqMan NGen, require adequate random-access memory (RAM) to cluster sequence data into contigs. There are several factors that contribute to memory usage, some of which are obvious, like number of reads, and others that are more difficult to estimate, like genome complexity. The most important factors include:
- Number of reads
- Genome Size
- Read Length
- Genome Complexity
- Read accuracy
NGS users generally have access to paired-end Illumina data, at least 100bp in length, with at least 50X coverage across the genome.
Based on these assumptions, we downloaded four Illumina 2X100 paired end data sets from the Short Read Archive, for four organisms; E. coli, S. cerevisiae, N. crassa, and C. elegans, and used SeqMan NGen to restrict input reads to obtain 50X coverage prior to assembly. RAM usage was monitored during assembly so that peak max committed memory (physical RAM + disk paging files) could be recorded.
The graph plots memory usage vs. genome size and reveals to us a rough “rule of thumb” that when using “typical” Illumina data at “typical” coverages, 1GB RAM is required for every 1Mbase of genome, a much more linear relationship than expected.