Pairwise Alignment with MegAlign Pro: Choosing the best alignment for your project
Written by Eric Cabot, PhD
With the latest version of Lasergene, MegAlign Pro, part of the Molecular Biology Suite, supports pairwise sequence alignments. In addition to the five multiple alignment engines that were already available, users can now perform local, global and semi-global pairwise alignments using the industry standard Smith-Waterman and Needleman-Wunsch algorithms. This article highlights some of these new capabilities and explores situations where aligning sequences in pairs is more appropriate than aligning many sequences at once. At the end of the article, we’ve also provided some example projects for you to experiment with the different alignment methods.
When should I use a pairwise alignment?
The answer to this question may seem obvious: use pairwise alignment when you are only interested in two sequences. Also, sometimes pairwise alignment is simply more suitable than multiple alignment, and we’ll look at some examples later on. Additionally, there are situations where a multiple sequence alignment (MSA) might help identify pairs for sequences or sub-sequences that are worth a more detailed, pairwise comparison.
Beyond workflow considerations, there are some fundamental differences between the two categories of alignment that might make pairwise alignment a better option for some sequence comparisons. Due to the nature of progressive multiple aligners (including Clustal, MUSCLE and MAFFT), the final sequence alignment can contain inappropriately placed gaps, which adversely affect the interpretation of the results. To understand this, we need to take a closer look at how progressive multiple alignments work. The process invariably begins with a single pairwise alignment, adding gaps as necessary in order to minimize the number of mismatches. As the aligner proceeds, additional gaps are added as single sequences and groups of sequences aligned during an earlier stage of the process are included in the growing multiple sequence alignment. During this phase, gaps may be added but are never removed.
This “once a gap, always a gap” approach is a potential drawback that is shared by all progressive multiple alignment algorithms. The heart of the problem is that gap placement (and therefore the alignment) might be affected by the order in which sequences are aligned to each other because sequences added later in the process might be incorrectly aligned. All of the multiple alignment engines used by MegAlign Pro use a “guide tree” based on pairwise similarities of sequences to determine the order in which to align sequences. The first pair chosen consists of the two that are least distant on the guide tree. If the nearest neighbor to this pair is more distant than some other pair are to each other, that pair gets aligned to each other. If not, the neighbor is aligned with the first pair and gaps are added as necessary. In later rounds there may be no singleton sequences left, just clusters of two or more sequences that got aligned. Imagine a case where one of a group of early aligned sequences should have been added later, or where a close relative was added too late. It’s hard to know when this has happened unless you have some a priori information, such as knowledge of the evolutionary relationship of your group of sequences.
The bottom line is that when you examine just a pair from a multiple sequence alignment you may not see the same results as a pairwise alignment of just the two. So the direct approach under these circumstances might give a better picture of the relatedness of the pair.
Which is better, multiple or pairwise alignment?
This question is difficult to answer because it very much depends on how the alignment is going to be used.
Mechanistically, the best sequence alignment is the one that produces the fewest number of mismatches. That metric can be misleading, especially if minimizing the score entails extreme amounts of gapping. Consider an example where the goal is to identify some particular conserved domains or a large insertion. Here the placement of gaps outside of the regions of interest may well be of limited concern. Note that with MegAlign Pro, you can select the interesting regions identified by a multiple alignment and copy them as subsequences to a new document for further analysis.
Now consider a situation where a multiple sequence alignment is used to represent the actual relatedness of a group of sequences. Here the alignment is essentially a model, typically of an evolutionary process. In this case the “best” alignment is the one that is most plausible in the light some biological theory or model. One way that this visualized, of course, is to use the alignment to make an evolutionary tree.
Now we are ready to take a more detailed looks at some examples of alignment workflows.
Three types of pairwise alignment: Local, Global and Semi-Global
The three types of alignment are actually quite similar, although they can often produce very different results. All use a method called dynamic programming to find the best scoring alignment between two sequences. Alignment scores are computed by adding up per-base match scores and subtracting a penalty for opening a gap (of any length) and another for the number of positions that have gaps. The match scores are based on a scoring matrix such as NUC42 or BLOSUM62.
Tip: It’s always a good idea to explore the effect of various settings of these three parameters to see if you can get a more desirable outcome.
Depending on your two sequences, the three methods can potentially yield widely different results, so it’s important to understand how they differ.
Local Pairwise Alignment
MegAlign Pro’s local alignment algorithm, a modernized variant of the one described by Smith-Waterman (1981), is designed specifically to find the highest scoring aligned segments of two sequences, even if the full extent of the two is not included in the final alignment. (Note: in MegAlign Pro, the “Show Context” check-box in the Style Panel lets you display any unaligned parts of the sequences flanking the aligned segments).
Global Pairwise Alignment
The alternative to locally aligning is to align globally. To do this MegAlign Pro uses two variants of the Needleman and Wunsch (1970) algorithm. Global aligners don’t try to find the best scoring segment, but instead require that the full extent of both sequences be included in their results. There is no requirement or guarantee that the best scoring pair of aligned segments from a local alignment will be aligned in a global alignment.
Semi-Global Pairwise Alignment
Semi-global alignment is a relatively new approach that is particularly suitable when the two sequences differ greatly in length. When that happens, the longer sequence will have overhangs on either end of the alignment. Since overhangs are represented with gaps, a global aligner will attempt to increase the match score and minimize accumulated gap penalties by aligning parts of the shorter sequence to overhanging sequence region(s). This effect can produce a number of unrealistic, usually small aligned segments spaced by gaps near the ends of the alignment. Semi-global alignment is designed to address this problem by not penalizing gaps in overhangs (aka “end gaps”).
The difference between these three pairwise approaches really can make an impact in the resulting alignment, but the choice of which to use really depends on your task. For basic cases, such as aligning two genes or proteins, local alignment is a good starting point, but when things get more complicated, global or semi-global may be the way to go. Let’s look at some examples to better understand the differences between these methods.
Download our step-by-step tutorials and demo data to see how these different alignment methods compare, and when to use each one, using the examples that follow.
Example 1: Going from a multiple alignment to interesting pairwise comparisons
Tutorial #1 (“Multiple versus pairwise alignments”) shows an example where a pairwise alignment can help resolve a confusing placement of gaps within a multiple alignment. In this case, a multiple protein sequence alignment suggests that the protein sequence from a specific organism (Tupaia chinensis, the Chinese tree shrew) is severely truncated at its C-terminus and ends with a run of 27 residues that seem to be unrelated to the other members of the alignment, including a relative (Sorex araneus, the Eurasian shrew). A pairwise global alignment of the two shrew sequences, however reveals that a more likely interpretation is that the the T. chinensis sequence contains a deletion of 235 residues followed by a terminal stretch of 32 amino acids that is nearly identical to that of the S. araneus sequence. Here the pairwise alignment suggests that first-pass multiple sequence alignment is not optimal. Armed with this information, you can try changing the alignment engine and gap penalties to see if a more reasonable result can be achieved. Another technique that might help is to use sub-alignments to refine the overall multiple sequence alignment.
This example also demonstrates the power of using pairwise and multiple alignments together to help interpret specific relationships between sequences that might have become obscured by gaps which were added during the multiple alignment process. With MegAlign Pro, it is simple to generate many pairwise alignments without ever having to disturb the multiple sequence alignment, which has the larger picture. This is far more convenient that starting over with several different documents.
Example 2: Aligning transcripts to genes
Tutorial #2 (“Aligning transcripts to genes”) illustrates the utility of pairwise alignments when comparing mRNA transcripts to their cognate genes. This example begins with multiple alignments of the alcohol dehydrogenase (ADH) gene from Drosophila melanogaster and four mRNA transcript isoforms. Comparing the overview of the alignment to the gene’s annotations reveals a few problems that are easily resolved with pairwise alignments.
As with the previous example, a global alignment is the most suitable approach, because long gaps corresponding to introns may make it hard for a local aligner to join segments from exons. With a local alignment the ends of a transcript may not be included among the aligned segments.