How to Assemble Genomes like a Bioinformatics Pro

Home » How to Assemble Genomes like a Bioinformatics Pro

Home > Blog > How to Assemble Genomes like a Bioinformatics Pro

How to Assemble Genomes like a Bioinformatics Pro

Choosing the right assembly strategy for NGS and long read sequencing data in SeqMan NGen

Whether you are working with Illumina, Oxford Nanopore, PacBio data, or a combination of these, assembling a complete genome can be a challenge. Illumina and other NGS technologies produce highly accurate results, but the short reads can pose challenges in closing genomes. Newer long read sequencing technologies like those from Oxford Nanopore Technologies (ONT) and PacBio produce reads that can span all but the longest repeat elements in a genome. The lengths of these sequences make it possible to construct topologically correct consensus sequences for each chromosome and plasmid/organelle using de novo assembly.

However, due to the error-prone nature of long read data, the assembled sequences typically contain numerous large and small errors. The starting sequences generally have a base level accuracy of 99%-99.5%. This is equal to an error every 100-500 bases. A typical 5Mb bacterial draft sequence is expected to have 25,000-50,000 errors, which translates into each gene having an average of 2-10 errors. As most errors are small insertions or deletions, many genes will appear to contain frameshifts. Therefore, among the many adantages to having a highly accurate sequence, if an accurately annotated genome is required, it is imperative that the errors be removed.

How can I overcome the limitations of long read sequencing data?

SeqMan NGen (part of Lasergene Genomics) features a de novo genome assembler for long read data, as well as a group of genome finishing workflows that use highly accurate NGS data to correct, or “polish,” the de novo assembled consensus sequences. The different genome finishing workflows allow you to select the one tailored to the type and extent of remaining errors in your sequences. Regardless of your starting point, SeqMan NGen has a workflow to get you closer to a complete error-free genome.

Which workflow should I select for my data?

SeqMan NGen’s “Workflow” wizard screen provides a number of de novo assembly options. This article will discuss the workflows boxed in red below. Note that these workflows are located in the “De Novo Assembly and Finishing” tab of the wizard screen.

The best way to determine your starting point is to consider which type(s) of data you have available. These data types may include one or more of the following:

– Long read sequencing data from PacBio or ONT

– NGS sequencing data from Illumina or Ion Torrent

– A draft assembly

– A closely-related reference genome

Below, we recommend specific options based on the type of data you are using. Note that you may need to try multiple options to see what produces the best result for your data.

Figure 1. SeqMan NGen’s setup wizard provides a number of choices for NGS and long-read de novo assembly.

I have only long read or only NGS data

Depending on the data type, use either the NGS-Based de novo assembly or PacBio/Nanopore de novo assembly workflows.

For PacBio/Nanopore data, assemblies can be done using either the “raw” sequence reads or “corrected” reads, which are a set of overlapping reads spanning each contig whose base accuracy have first been improved by alignment with other homologous reads in the data set.

The goal is to assemble reads into one topologically correct contig for each of the chromosomes and large plasmids/organelles. In practice, large chromosomes are often broken up into multiple contigs. In addition, small plasmids are often not represented in the data due to library preparation procedures.

Output files:

– An SQD project file that can be edited using SeqMan Ultra.

– A FASTA file of the consensus sequences that can be used as a starting point for downstream polishing/finishing workflows.

I have both long read and paired-end NGS data

Use the PacBio/Nanopore workflow De novo assembly and polishing. This workflow effectively combines two other workflows that you would otherwise need to do sequentially. In this workflow, you must use paired-end NGS data from the same isolate or sample that the long read data came from.

Behind the scenes, this single workflow first assembles the long read data using the algorithm from the PacBio/Nanopore workflow De novo assembly. Next, it polishes the consensus sequences using the algorithm from NGS polishing of a draft genome.

Output files:

– An editable SQD file of the de novo assembly.

– An SQD project file of the polished assembly for review and initial editing in SeqMan Ultra.

– FASTA files of both the de novo assembled and polished consensus sequences.

I have paired-end NGS data and a long read assembled initial draft sequence from a de novo long read assembly in DNASTAR SeqMan NGen (or from a third-party assembler)

Perform an initial polishing of the consensus sequences using either the PacBio/Nanopore workflow NGS polishing of a draft genome or the NGS-Based workflow Genome finishing – initial error correction. While functionally the same, we present these workflows as separate options for those who may be more familiar with either the term “polishing” or “genome finishing” to refer to the correction of internal assembly errors in the contig consensus sequences.

Users with long read assemblies from HGAP, Canu or Unicycler may find the options in the NGS polishing of a draft genome workflow to be more familiar. The Genome finishing – initial error correction workflow can be used to polish NGS or long read assemblies with additional NGS data.

Both workflows serve as a first phase comprehensive finishing step by taking de novo assembled draft consensus sequences and using a series of automated steps with high accuracy NGS data to correct the mis-assemblies and misalignments within each sequence. Both workflows involve a two-step process:

1) Paired-end NGS data are used to correct internal errors (both small and large mis-assemblies/misalignments) in an existing set of contig consensus sequences from a draft assembly of long read data. Again, the NGS data should come from the same isolate or sample as was used to generate the initial draft assembly.

2) A final de novo assembly of unaligned NGS reads identifies pieces of the genome missed during initial assembly (e.g. small plasmids or gaps between sequences caused by low coverage in a long read data set).

Output files:

– An SQD project file of the polished assembly for review and editing in SeqMan Ultra.

– A FASTA file of the consensus sequences after polishing, but prior to any manual editing.

I have both paired-end NGS data and an initially polished set of consensus sequences

Correct any remaining errors and/or small gaps between contigs using Genome finishing – refinement.

This workflow is a rapid final finishing step that aligns high accuracy NGS data to polished consensus sequences. It allows for rapid cycles of editing and confirmation ensuring that that the final sequence(s) has no ambiguous bases and that there is uniform coverage of consistently placed paired end data across each of the contigs.

The workflow begins with an existing set of polished contig consensus sequences and corrects any remaining internal errors in each contig using paired-end NGS data. It will also close any remaining small gaps between contigs using the pair information. There is also a setup option to extend and align read data off the ends of each contig which facilitates closing small gaps between sequences.

Output file:

– An SQD project file for review and editing in SeqMan Ultra.

I have both paired-end NGS data from a new strain/isolate and a closely related reference sequence

Construct the new sequence using the NGS-Based workflow Combined reference-guided/de novo assembly.

If you have NGS data, this workflow is an alternative method to de novo assembly for constructing a new genome. It uses an existing reference sequence from a closely related strain or isolate as the starting template. It then applies the same series of comprehensive steps as the PacBio/Nanopore workflow NGS polishing of a draft genome or the NGS-Based workflow Genome finishing – initial error correction to replace sequence differences from the reference with those of the new genome.

This workflow can also be used to discover and determine the sequence of plasmids/organelles specific to the new genome. The consensus sequences from this workflow are also suitable for final finishing using the NGS-Based workflow Genome finishing – refinement.

Output file:

– An SQD project file for review and editing in SeqMan Ultra.

Why is genome finishing necessary?

As mentioned in the introduction, removing assembly errors is important for many downstream applications including producing an accurately annotated genome. However, it is now evident that there are several different ways to do this. To simplify your decision, we created a handy flowchart.

Figure 2. Flowchart for choosing a de novo assembly workflow in SeqMan NGen.

As shown in the flowchart, the NGS polishing of a draft genome and Genome finishing – initial error correction workflows are best used as the first finishing step, as they offer the most comprehensive removal of small and large scale errors as well as assembling pieces of the genome missed in the initial assembly. The resulting corrected consensus sequences commonly has a base level accuracy of >99.99%, meaning that our typical 5Mb bacterial genome will now have only ~50 remaining errors or one every 10kb on average. While this is level of accuracy is sufficient for provisional annotation, it still leaves roughly one in ten genes with at least one erroneous base in their sequence.

Applying a round of the Genome finishing – refinement workflow to the polished sequences removes most of the remaining errors, although manual inspection and editing in SeqMan Ultra is usually required. The rapid nature of this workflow makes it possible to do as many rounds as needed to refine and ultimately confirm the sequence. The number of rounds of polishing and refinement depends upon the quality of each intermediate assembly. Initial de novo assemblies typically have many more errors, large and small, that require the more comprehensive types of correction in the Finishing – initial/polishing workflow. The final sequence should be of very high quality, containing no or a small number of errors, and ready for annotation.

Figure 3. Benchmark data for contigs and scaffolds created using three different SeqMan NGen workflows.

How do I know when I have done enough rounds of polishing or refinement?

After the initial correction, you can look for likely errors by inspecting the assemblies in SeqMan Ultra. Here are just a few examples:

– Circularly redundant ends can be removed.

– Small plasmids that were not assembled with the long reads can often be identified by looking for contigs with a high depth of coverage.

– A sizable piece missed in the initial de novo assembly may be present as a new Illumina-based contig and can be identified using pair information.

Figure 4. SeqMan Ultra can be used to group contigs from a paired-end data assembly into scaffolds.

When you have a high quality initial assembly, though, it’s often just a matter of saving the consensus sequences of the chromosome(s) and organelles/plasmids and running them through the refinement workflow one time.

With all the assembly tools available today, why choose SeqMan NGen for long read or NGS assembly?

We know that when it comes to sequence assembly, you have many options – from homegrown solutions to open-source tools to commercial software packages. Here are three reasons you should strongly consider SeqMan NGen for long read or NGS assembly:

Integrated Solution – Lasergene Genomics enables you to perform all your sequence assembly and analysis in one comprehensive package. Start with one or more rounds of custom assembly using our intuitive SeqMan NGen wizard. For downstream analysis and editing, open the resulting files in graphically-rich SeqMan Ultra, newly released in 2020. Lasergene Genomics runs on suitably equipped standard desktop computers or on the DNASTAR Cloud.

Flexibility – Only a few labs have dedicated bioinformatics staff available to create assemblies and customize refinement steps. If you’re one of the majority who does not have this option, SeqMan NGen provides a comprehensive set of workflows that can produce a completed genome starting at various stages of assembly and finishing. The software supports long read, NGS and Sanger data.

Accuracy – SeqMan NGen was developed by scientists who had to assemble and annotate genomes with the utmost accuracy. Our genome polishing workflow has been proven to correct long-read assemblies better than the alternatives, producing more accurate genomes with a larger percentage of the genome captured after polishing. By using one or more of these new SeqMan NGen workflows, you can obtain perfect or near-perfect sequences possible with minimal manual intervention. DNASTAR’s development team is continuously refining the software and adding new options and features to meet ever-evolving needs.

And here’s a bonus “fourth reason”: If you don’t want to go through the steps of assembly and polishing, you can also use DNASTAR services to have us perform assembly and annotation for you!

Ready to try de novo genome assembly for yourself? Download a 14-day free trial or visit our workflow page to learn more.

TRY LASERGENE FREE

LEARN MORE

Would you like to receive technical tips and special offers straight to your inbox?