Understanding Mer Tags

The Pro Assembler layout algorithm relies on unique subsequences of bases, or mers, that occur in overlapping regions of fragment reads. Mers that are common to two or more fragment reads are aligned to determine the overall layout of reads. Overlapping reads have many mers in common, but only a few mers per overlapping region are needed to identify the overlap. These mers are called mer tags. The use of mers to tag fragments and identify overlaps is illustrated in the following figure:

 

mer tag figure

 

Note: As shown in the above figure, a 54bp original DNA sequence is covered by five overlapping fragment reads. The 6-mer tags for each fragment read are underlined. Matching mer tags are aligned to determine the layout of the reads.

 

The power of using mer tags relies on the ability of the Pro Assembler to choose mers that are most likely to occur only once in the original DNA sequence. It is important to avoid choosing mers that occur in repeated regions since the result may be fragment reads that are incorrectly aligned together.

 

Three assembly parameters are involved in choosing mer tags: Match Size, Use Repeat Handling, and Match Spacing.

 

The settings of the Match Size and Use Repeat Handling parameters help to choose tags that are most likely to be unique in the original DNA sequence. The Match Size sets the length of the mers. The longer the mer, the more probable that the mer is unique. The Use Repeat Handling parameters help to identify which mers are not likely to be unique. If a mer occurs more often than expected in the dataset, the mer may be part of a repeated region.

 

The Match Spacing parameter specifies the preferred distance between mer tags. The smaller the Match Spacing, the more memory and more time the assembly will take. If a fragment read is shorter than the Match Spacing, multiple mer tags are still chosen for the read.