Answers to your “Genomic Data Assembly Strategies” webinar questions
We recently presented a webinar entitled “Choosing the best assembly strategy for your genomic sequencing data.”
You can view the recording of this webinar and others that we’ve presented on our Webinars page.
The software demo took almost the full hour, so presenter Matt Keyser only had a few minutes in which to answer your questions. As with other recent webinars, we therefore decided to turn a selection of your questions into the following “Q & A” blog post.
Topics Covered in this Article
Using SeqMan NGen for assembly
- How do you check quality of reads before loading?
SeqMan NGen can check the quality of and discard poor quality reads automatically. In the Preassembly Options wizard screen, check the box next to Quality end trim.
- Are assembly parameters different for genomic data vs. metagenomic data?
Yes. The selection you make in SeqMan NGen’s first wizard screen (“Workflow”) determines what you see on all subsequent wizard screens, including options and preset default values. In many cases, you can keep the default settings and simply upload your data files when prompted. For times when you do want to customize settings, our comprehensive SeqMan NGen User Guide describes each screen and setting option in detail.
- How can I find SNPs using my sequencing data with SeqMan NGen? How can I find which of these SNPs might cause health issues?
To direct SeqMan NGen to find SNPs automatically during assembly, launch SeqMan NGen and choose New Project. On the first screen (“Workflows”), choose any workflow from the Variant analysis/resequencing workflows tab. When the assembly finishes, you can open it in ArrayStar, SeqMan Ultra and/or SeqMan Pro to analyze the SNPs found during assembly.
Regarding the “health issues” question: If you are working with human data, choose a variant workflow as directed above. In the Analysis Options screen, also check the Import Variant Annotation Database box. If you use ArrayStar to analyze the finished assembly, you will be able to use a large variety of databases like ClinVar, MutationTaster, and dbSNP to search for SNPs that are known to be associated with health issues. Click here for a tutorial that covers this workflow from setup through downstream analysis.
- Are there particular settings I should use for an AT or GC rich genome?
You can use the same assembly settings for either AT or GC rich genomes.
- How does SeqMan NGen handle repetitive regions in bacterial and fungal genomes that use NGS data?
SeqMan NGen attempts to align sequence data through the repetitive regions, using paired data if available. Long read data often extends through repetitive regions resulting in less fragmented draft genome. However, if the repetitive regions are longer than the sequence reads, these regions will not assemble into a single contig.
- My genome has multiple copies of some genes. How would I set up a short read assembly?
It might be beneficial to increase default minimum match stringency from “93” to “95” or “97” so that the assembler can parse the reads belonging to the gene copies.
- Can I download a viral database and use the database as a reference for templated assembly?
Yes. However, note that most viral databases contain many very similar sequences, so you may get better results using the reference guided metagenomic workflow that can account for reads mapping to multiple template sequences.
- Using SeqMan NGen and NGS data, how would I locate sequence that is highly mutated in the SARS2 virus?
There are two approaches: 1) align your data to all known SARS2 reference genomes using the reference guided metagenomic workflow 2) de novo assemble your data using either the genome or metagenomic workflow.
- Is there any option for hybrid assembly (long and short read) using Unicycler?
No, DNASTAR provides a hybrid assembly workflow using our own assembly algorithms.
- What are the current nonsynonymous SNP prediction tools included with SeqMan NGen 17.1?
The SeqMan NGen v17.1 SNP prediction tools are based off the MAQ variant calling algorithm along with an annotated reference genome to make nonsynonymous variant calls.
Miscellaneous SeqMan NGen-related questions
- What kind of computer hardware do I need for a SeqMan NGen assembly?
For Lasergene Genomics hardware requirements, see our Technical Requirements page.
Generally, for de novo assemblies and Illumina data, RAM is the limiting factor. A very deep assembly can double or triple the amount of RAM needed. The SeqMan NGen wizard allows you to decrease depth to reduce memory use during assembly. If you plan to do frequent de novo assemblies, however, we recommend using a 32GB or higher machine.
Currently, long-read assembly in SeqMan NGen is limited to microbial genomes. We are actively working to increase the capacity to small eukaryotic genomes and beyond.
- How would I annotate genome sequencing data?
DNASTAR’s Genomic Services can assist with annotation, assembly, or both.
- Can I use SeqMan NGen offline?
SeqMan NGen is a locally-installed application and the setup takes place on your local computer. Once you are ready to submit the job for assembly, the SeqMan NGen setup wizard considers your data size and computer hardware and then recommends whether to run locally or on the cloud. The latter requires one or more DNASTAR Cloud Assembly licenses. These can easily be purchased online through our Academic/Commercial pricing pages.
As a general rule, Sanger data and bacterial genomes can usually be assembled locally, while large eukaryotic assemblies will likely require Cloud Assembly to assemble successfully.
- Any tips for assembling large data sets?
Definitely! Here are two things you can try.
– In SeqMan NGen’s “Assembly Options” screen, lower the value for Maximum total reads. Click here for a tutorial that uses this strategy for whole-genome assembly.
– If the method above is not enough, SeqMan NGen provides the option of running your assembly on the cloud. DNASTAR Cloud Assembly licenses can easily be purchased online through our Academic/Commercial pricing pages.
- What types of read technology does SeqMan NGen support?
SeqMan NGen supports Sanger/trace, Illumina, Ion Torrent, Nanopore and PacBio sequence technologies in over a dozen read formats and over two dozen read file extensions. For a complete list, please see the “SeqMan NGen” table midway down the File Formats page on our website.
Using other Lasergene Genomics apps for downstream analysis
- Which Lasergene applications are used for downstream analysis?
After assembling genomic data with SeqMan NGen, you can use SeqMan Ultra, SeqMan Pro, ArrayStar and/or GenVision Pro for downstream analysis. For examples of each, read through or try out some of our SeqMan NGen tutorials.
- After assembling my genome, how can I use Lasergene applications to compare it to other closely related genomes?
There are a couple options: 1) You can use your assembled genome as a reference to compare sequence data (fastq) from other closely related genomes 2) If the other genomes are also assembled, you can use the Mauve genome aligner in MegAlign Pro to align the genomes to compare them.
- In the case of a heterozygote, half of the sequences would align perfectly, and half would be extra. Is this detectable?
Yes, heterozygote variations are detectable. In some cases, the reads from both heterozygotes align to the common reference. In other cases, a deletion or insertion can be detected in 50% of the reads.
- Can a reference-guided genome assembly detect the addition or deletion of a gene cluster (~50kb)?
Regarding the addition, SeqMan NGen does not specifically flag this. However, you will see the following when you perform downstream analysis in SeqMan Ultra or SeqMan Pro:
– A break in the alignment that results in > 1 contig being produced.
– One or more reads appearing in the “Unassembled Sequences” list.
To identify elements in the sequence strain that did not match the reference, note which reads appeared in the list and then de novo assemble them using SeqMan Pro.
A reference-guided assembly “Variant detection workflow” produces a Structural Variation report that will detect a 50kb insertion and deletion.
– Deletions – can be visualized in the Strategy view as a stack of split reads (pink reads in Strategy view) flanking the deletion site and (typically) a drop in alignment coverage in the deleted region.
– Insertions – are detected by trimmed sequences flanking the insertion site, but size cannot be accurately determined without proceeding to genome finishing steps.
- What can I do to increase the Contig N50 value in my finished assembly?
When you open a finished assembly in SeqMan Ultra, the Project Report shows the Contig N50 value, a measure related to the average contig length. See this tutorial for one method of increasing the Contig N50.
- If sequences assembled all together we would only get one contig. Why, then, do your example assemblies end with multiple contigs?
In an ideal world, both de novo and reference-guided assemblies would result in a single contig. De novo assembly is limited by repetitive elements in the genome. If the largest repetitive element is longer than the sequence reads, this region cannot be de novo assembled (accurately) and will result in multiple contigs. Poor data, thin coverage and contaminants can also contribute to a more fragmented genome.
Questions related to other Lasergene applications
- I have 7 genes (cds) of mung bean yellow mosaic India virus. Two genes are virion-sense and five genes are complementary-sense. How would I assemble these together?
For the workflow you describe, you would not do an assembly in SeqMan NGen. Instead, you would use MegAlign Pro, our application for multiple sequence alignment using algorithms like Clustal Omega, MUSCLE, and MAFFT. MegAlign Pro allows you to reverse complement selected sequences before you do the assembly.