DNASTAR recently released a series of white papers that compare alignment and variant calling for next-gen sequence data in Lasergene Genomics Suite to some of our commercial and open source competitors. While these papers primarily focus on accuracy and computational speed, there is a third set of metrics that is more challenging to measure: What is the user experience like for the various assembly and analysis tools, and what is the calendar time required to get up and running with open source tools?
With open source software in particular, users have a large learning curve to overcome. I recently sat down with DNASTAR Senior Scientist, Dr. Eric Cabot, to learn about some of the steps involved in using these tools. Dr. Cabot ran a bioinformatics core facility at the University of Wisconsin for several years and he is very familiar with many of the challenges and pitfalls of next-gen sequencing and downstream analysis.
What are the largest obstacles that users face when analyzing next-gen sequencing data?
First you need a lot of disk space, because you’re dealing with large amounts of data. And that often means you are going to spend a lot of time on the analysis.
In the case of open source tools, there are many different pieces of software and many steps involved; often you don’t know if you are using the latest version of the respective tools; each tool has its own quirks; and these tools are not necessarily fast. Putting together a workflow is a challenge because it requires a relatively high level of computer expertise.
So the biggest hurdles are the size of the data, the time and expertise involved in obtaining and using the software, and then finding a way to look at the results.
What steps are involved in setting up a pipeline using open source tools?
Step one is identifying the components you need. For example you need to obtain an alignment program, such as BWA or BOWTIE. You also need helper applications, such as Samtools, to help manage and analyze the data. Once you’ve identified the tools you need, you may have to compile and configure them yourself.
Before you can run the alignment, you have to index the genome. Then you align, or map, your reads, usually resulting in a large SAM file. Then you need to convert to the SAM file to its binary counterpart, a BAM file. Once you have a BAM file, you need to sort and index the file for downstream analysis. Finally, you have to improve your alignment, because there might be some gaps caused by insertions and deletions.
All of these steps require learning and running a variety of tools with their appropriate command line options, which can take a lot of time. By contrast, with Lasergene and SeqMan NGen, all you need to do is install the software, download a genome template package, and then assemble your data.
What about variant analysis?
Some aligners are very stringent and throw away too much data. By erring on the side of caution, these aligners end up missing some true variants. By contrast, DNASTAR software uses minimal hard filters during alignment, and then allows you to quickly and easily apply additional, more stringent filters after the alignment is complete. These soft filters are able to be changed on the fly without re-running your alignment.
In addition, to make sense of your SNPs, you need to be able to distinguish known SNPs from novel SNPs. Many open source tools just provide a table of called variants and require additional scripting steps to compare the found SNPs to known SNPs. Lasergene has built-in filters to easily identify the SNPs of interest for a given project.
How do you learn all the command line options in the various tools?
For most public software, you have to rely on forums for guidance. Sometimes there are additional resources online, but these are often not maintained. Many people are just running one or two projects, and they don’t want to invest a lot of time learning how to use the assembly and analysis tools.
How long does it take to set up and run a pipeline like this?
Each of the individual steps and commands can take anywhere from a few minutes to several hours. But the first time you’re trying to set up the open source pipeline and connect the various pieces of software, it is often a few weeks before you can actually begin to see and analyze results.
On the other hand, with Lasergene, you can set up your first NGS assembly or alignment project in just minutes and usually begin analyzing your results the same day.
Not ready to commit to learning open source tools? Request a free trial of Lasergene Genomics Suite to see how easy our software is to use.