Choosing Between the Classic or Pro Assemblers

SeqMan Pro provides two options for sequence assembly methods: Classic and Pro (Allex 1999).

The following are general guidelines for when each assembly method should be used.

Note: If you are assembling next-gen data or very large data sets, we recommend using SeqMan NGen rather than either type of SeqMan Pro assembler.

Use Pro Assembler when:

• Your data set is medium in size. (Large data sets should be assembled instead with SeqMan NGen.) The time to assemble is far less.

• You have repeated sequences in your dataset. Using the Use Repeat Handling function and the Match Percentage calculation helps prevent repeated regions from assembling together incorrectly.

• You have data with noisy ends. Use the Maximum Mismatch End Bases parameter to ignore the noisy ends.

• You are doing variant analysis. Gapped regions are aligned more accurately.

Use Classic Assembler when:

• You want to reproduce an assembly you did with an earlier version of SeqMan Pro.

• You have no repeated sequences in your dataset and your dataset is very small.

• You do not use vector trimming. The global Match Percentage calculation will allow sequences with small vector regions to assemble together.

Note: You may select an assembly method as your default method, according to your own personal preference. See Setting Default Parameters .

Other considerations:

• Assembly Time: The Pro Assembler assembles sequences substantially more quickly than the Classic Assembler. As the number of sequences increases, the time to assemble increases linearly with Pro as opposed to a quadratic increase with Classic. For example, with approximately 1000 sequences, it can take Classic four times as long as Pro to assemble.

• Memory Use: With medium-sized data sets (large data sets should be assembled with SeqMan NGen), the Pro Assembler assembles sequences with much less memory than the Classic Assembler. As with assembly time, as the number of sequences increases, the memory required increases linearly with Pro as opposed to a quadratic increase with Classic.

• Match Percentage: The Match Percentage calculation in the Pro Assembler is more precise than the calculation in the Classic Assembler. The Pro Assembler uses a local Match Percentage, which requires that the Match Percentage threshold be met in each overlapping window of 50 bases. The Classic method uses a global Match Percentage that simply requires that the Match Percentage threshold be met over the entire alignment.

• For data sets that contain repeated regions, and that have been trimmed for vector, the Pro assembly method is likely to produce a more accurate assembly with fewer false joins. For data sets that do not include repeated regions or have not been trimmed for vector, the global calculation used in the Classic assembly method is likely to produce fewer, larger contigs than the Pro method.

An example containing a repeated region follows.

A fragment of a genome has a repeated region, labeled A and A’, and two unique regions, labeled B and C.

When the fragment is sequenced, one of the sequences contains parts of regions A and B, and another contains parts of regions A’ and C:

In this example, the default Match Percentage threshold of 80% is used for both the Pro and Classic assembly methods. When the two sequences are aligned, the 400 bases in the overlapping A and A’ regions match 100%. The 200 bases in the overlapping B and C regions match 42%. Over the entire alignment, 476 bases out of 600 match, yielding a global Match Percentage of 81%. This exceeds the Classic assembly method threshold of 80% and the sequences are incorrectly assembled in a contig.

In the Pro assembly method example, the Match Percentage is checked for every alignment of 50 bases. The alignment below shows the last 36 overlapping bases of A and A’ and the first 18 overlapping bases of B and C. Each mismatch in the overlap is marked by an X below the alignment. In the first 50 bases shown, there are 41 matches, and the Match Percentage is 82%. This is above the threshold of 80%, so the Match Percentage of the next 50 bases is checked and is also found to be 82%. Each fifty bases are checked along the overlap as long as the Match Percentage is at or above the threshold. In this case, the alignment fails once it gets far enough into the overlap of the unique regions, B and C, and the Match Percentage is only 78%. The sequences will not be assembled together into a contig, which is correct for this data set.

• Repeated Data: The Pro Assembler provides special handling of repeated data and produces fewer mis-assemblies of repeated regions. The Use Repeated Handling option allows the user to select this function and to set its parameters.

• Noisy Data: Noisy Data can be trimmed from the ends of sequences using the Trim sequence ends preassembly option. In your work, you may prefer to use a low trimming stringency that still leaves some noisy data on the ends of your sequences. For those cases, the Pro Assembler provides a parameter, Maximum Mismatch End Bases, that allows the algorithm to ignore noisy ends of the sequences when checking the Match Percentage. The value entered for the Maximum Mismatch End Bases will be the number of bases from either end where mismatches are not counted in the Match Percentage calculation.

• Alignment: The Pro assembly method uses an alignment algorithm, ReAligner (Anson & Myers 1997). Classic uses a combination of the Martinez and Needleman Wunsch algorithms (Martinez 1983, Needleman & Wunsch 1970). The Classic assembly method is prone to occasional misalignments in gapped regions. This can lead to incorrect identification of putative variants when doing variant analysis.