How much disk space do I need for my templated genome assembly?
In a previous post we described the RAM dependencies for de novo genome assemblies, which showed a linear relationship with genome size. Extrapolation for assembling large, eukaryotic genomes would suggest RAM requirements measured in terabytes, but quite often de novo assembly is not the goal. Rather, you wish to align large numbers of reads against a known reference genome in order to detect variation relative to said reference. For such applications, Lasergene Genomics Suite and SeqMan NGen use DNASTAR’s patented disk sort alignment (DSA) algorithm, greatly reducing RAM requirements. In using the DSA, the software uses disk space for temporary files required during the assembly process. The speed of the drives will impact the time of the assembly, and for maximum efficiency, we recommend having one drive for the input data and result files, and a separate drive for the temporary files.
The question then arises as to how much disk space will be needed for a templated assembly. Many of the same factors that impact RAM usage for de novo assembly influence the disk space requirements for a templated assembly, including: the genome size and complexity, the number of reads, the read length, and the read accuracy. The choice of an appropriate reference sequence is also critical due to the potential for misalignment, and to avoid elimination of critical sequences that cannot be aligned with confidence.
We collected sufficient Illumina 2×100 paired end data sets from the Sequence Read Archive (SRA) to provide approximately 40x coverage for a range of genome sizes from E. coli to H. sapiens. Data was assembled against the corresponding reference genomes using SeqMan NGen, monitoring the disk space utilized during the process. The measured disk space does not include input data (reads and reference genome) but it does include both temporary files and final result files. For non-microbial organisms, the graph suggests a rule of thumb of allowing about 0.5 – 0.7 GB of disk space per Mb of genome length. As shown in the graph, a human genome can be aligned against a reference genome using SeqMan NGen on a computer having as little as 2 terabytes of hard disk space available.
Learn more about reference-guided genome assembly in Lasergene and see benchmarks for various assembly times here.