De novo genome assembling and editing workflows - User Guide to SeqMan NGen - 17.6

The following table describes each of the workflows available in the De novo genome assembly and editing tab of the Workflow screen.

Group	Workflow	Description
ABI / Sanger	De novo assembly	Fast, accurate trimming and assembly of Sanger trace data, creating a project file that can be edited in SeqMan Ultra or SeqMan Pro. A non templated assembly of up to 30 million sequence reads and up to a 50 Mbase total length for all contigs combined. The capacity is determined by the amount of available RAM. When assembling a data set de novo, we recommend using paired end data, if available.
ABI / Sanger	Genome finishing – refinement	Align Sanger data to a draft sequence for further refinement of small errors. (Note: Use Variant Analysis/Resequencing workflow if your primary intent is SNP analysis). This workflow is most frequently used for extending off the ends of saved contig consensus sequences and correcting small errors within the contigs. This type of assembly can include up to 10 million reads and up to a 100 Mbase genome. It can be edited at a later time using a utility like SeqMan Ultra or SeqMan Pro.
NGS-based	De novo assembly	Assembly of Sanger, Illumina and Long-read sequencing data that produces a file that can be edited with SeqMan Ultra or SeqMan Pro. A non templated assembly of up to 30 million sequence reads and up to a 50 Mbase total length for all contigs combined. The capacity is determined by the amount of available RAM. When assembling a data set de novo, we recommend using paired end data, if available.
	Genome finishing – initial error correction	Align NGS data to a draft genome or contigs to correct large misalignments and smaller errors. This option utilizes both reference-guided and de novo assembly steps to resolve both single nucleotide and small multibase replacements (indels) as well as three types of larger structural variation (SV): insertions, deletions and large indels with minimal user intervention. In this workflow, your data should be from a haploid genome with at least one mate pair data set with read lengths of 100 bases or greater. Your total number of reads should be 10 million or less. If you use a larger data set, only the first 10 million reads will be used. For mate pair data, equal numbers of matching forward and reverse reads are processed. The SQD-formatted assembly can be edited at a later time using SeqMan Ultra or SeqMan Pro. When opened in either application, contigs will already be organized into scaffolds in the Explorer panel. This workflow replaces the “gap closure workflow” from Lasergene 16.0 and before. This newer version features an additional “refinement” stage before the “gap closure” stage and some additional “finishing” steps after the gap closure portion takes place. During assembly, data is processed in several stages: Data is mapped and aligned to a user-defined set of consensus sequences from which a new consensus sequence is determined. Five rounds of this consensus refinement process are performed to remove the majority of single nucleotide and small multibase errors. Data is mapped and aligned to refined consensus sequence(s) from stage 1 and then analyzed for characteristic SV motifs. The reference sequence is split at the detected SV sites, forming a series of ordered contigs. Mate pair and split reads from each SV event are collected in site-specific pools and assembled de novo. Deletions are detected using three types of data: split reads, spanning paired-end reads, and sequence coverage information. For insertions and replacements, mate pair reads corresponding to the new sequence are collected from the unassembled read pool. Only reads anchored by mates flanking the SV in the main assembly are used at this stage. The de novo assembled contigs are then brought into the main assembly and positioned consistently with the mate pair information. For SVs where the gap is not completely covered by the de novo assembled contigs (e.g. insertions longer than twice the size of the insert library), additional reads from the unassembled read pool matching and extending the ends of the joining contigs are added in an attempt to “walk” across the gap. This walk is terminated when either no new reads are found or when a repeated element is encountered. Note that the final de novo assembly performed in stage 6 typically results in additional contigs added to the final assembly project. These are often small contigs with redundant sequences of chromosomal segments. However, they can also represent plasmids, for example, that were not present in the input consensus sequences. Click here to see benchmarks for SeqMan NGen vs. three open source tools.
	Genome finishing – refinement	Align NGS data to a draft genome for further refinement of small errors and closing small gaps between contigs. This workflow is most frequently used for extending off the ends of saved contig consensus sequences and correcting small errors within the contigs. This type of assembly, which uses mate-pair data, can include up to 10 million reads and up to a 100 Mbase genome. It can be edited at a later time using a utility like SeqMan Ultra or SeqMan Pro.
	Combined reference-guided/de novo assembly	This workflow aligns paired end NGS data from a new strain/isolate to a closely-related reference genome (>90% identity) to replace SNVs and small indels as well as larger structural variants in the reference with the sequences of the new organism. This workflow is analogous to the Genome finishing – initial error correction workflow above and uses the same series of stages to construct the new sequence from the starting reference.
PacBio/Nanopore	De novo assembly (beta)	De novo assembly of long-read-only data sets with an option to first “correct” a genome spanning set of overlapping read prior to assembly. This workflow is designed to work with Oxford Nanopore and PacBio CLR & HiFi (AKA “CCS”) reads. This workflow typically produces more contigs than the standard single stage de novo assembly, but consensus sequences are usually of higher base-level accuracy. The optional “correct first” mode of this workflow is initiated from the Preassembly Options screen by selecting the Run a first-pass correction assembly option. The “correct first” mode consists of two stages. First, the set of primary overlapping reads covering each contig from end to end are identified and combined with their overlapping and containment reads in a series of mini assemblies, the consensus sequences of which represent “corrected” sequence reads. Second, the corrected read sequences are de novo assembled into a final assembly from which new consensus sequences are determined. In the Post Assembly Options screen, you can optionally specify a reference sequence to use for ordering contigs into scaffolds. Note: If your genome is over 15 MB in length, you can only use this workflow if you use PacBio HiFi data and specify that read technology in the wizard.
	De novo assembly and polishing (beta)	De novo assembly of long-read-only data sets followed by NGS polishing to correct assembly errors. This workflow is useful for error-prone first-generation long read data but is not necessary for PacBio HiFi data or newer-generation Oxford Nanopore data. Choosing this workflow will first de novo assemble a long read data set and then automatically run the Genome finishing – initial error correction workflow (above, this table) starting from the de novo assembled consensus sequence(s).
	NGS polishing of draft genome (beta)	This workflow is also known as the “Illumina correction” workflow. This option takes an existing set of long read assembled contig consensus sequences (AKA a “draft genome”) together with a NGS paired end data set from the same organism and runs the Genome finishing – initial error correction workflow (above, this table). The draft genome is often from HGap, Canu or Unicycler, but can also be a genome that was “sloppily” assembled in the past from 454 data; the NGS paired end data is typically Illumina data.

Include DESeq2 or edgeR statistics

Create a reference-guided assembly to use in the “SNP to Structure” workflow

Need more help with this?
Contact DNASTAR