Consider the case of a researcher who is trying to investigate the role of a gene isolated from a yet-unsequenced Salmonella strain. This strain has already been demonstrated to be both copper and multi-drug resistant. As demonstrated in the following tutorial, one way to proceed would be to align the sequence of the uncharacterized gene to the genome of a related reference strain. If a reasonable alignment was found, the annotations of the matching segment could then be examined to infer the function of the cloned gene.
- If you have not done so already, download the pairwise tutorial data and extract it to any convenient location (e.g., your computer desktop).
- Double-click on the Salmonella_CT18_plus_gene.msa project file to launch it in MegAlign Pro.
The document contains just two unaligned sequences: the genomic sequence of reference strain S. enterica serovar Typhi, strain CT18, and the unknown sequence.
- Select both sequences, right-click on the selection, and then select Align Pairwise from the context menu.
- In the ensuing dialog box, note that the default pairwise alignment method is Local: Smith-Waterman. Since this method is the best choice for finding a small segment of similarity within a larger sequence such as chromosome, keep the default settings and click the OK button. (Note: If any long gaps were needed for alignment, you could always realign later using lower gap penalties.)
This is a good time to discuss the concept of "reference" (or "target") sequence versus the "query" sequence. In MegAlign Pro, these two sequences are defined through the two drop-down menus at the top of the view. The left menu should be used for the reference, and the right menu for the query sequence. Most of the time, the sequence selected to be the “reference” is somewhat arbitrary. However, when one sequence is much longer than the other — as in this example — the longer sequence should be used as the reference.
After the alignment has completed, the Pairwise view header indicates that the 706 bp query matches a 702 bp segment in the CT18 genome with 98.6% Identity and has 1 gap that is 4 bp long.
- Scroll down through the alignment until you locate the gap, which begins at alignment position 452.
Notice that the gap is shown within the reference, indicating that there is a 4 bp insertion in the unknown sequence. Since the insertion length is not a multiple of three, the gap most likely represents a frame-shift in the unknown sequence.
- In the Tracks side panel on the right, locate the Pairwise Details section and check the box next to Translation.
- Returning your attention to the Pairwise view on the left, open the tracks for the unknown sequence by clicking the plus sign corresponding to that sequence.
The translation tracks (labeled 1, 2, 3) clearly indicate that the frame-shift has introduced an in-frame stop beginning at base 530 of the query sequence (red box in the image below). In other words, it represents a nonsense mutation. This is good evidence that this is indeed a mutant gene compared to the non-drug resistant reference strain.
To find more clues to the identity of the unknown gene,you will next investigate the CT18 reference sequence annotations at this position.
- Collapse the tracks for the unknown sequence by clicking on any of the minus signs corresponding to that sequence.
- Expand the detail tracks for the reference (CT18) by clicking any of the corresponding plus signs for that sequence.
Observe that the indel’s position overlaps a gene and a CDS that are identified as STY0266.
- Click anywhere on the STY0266 feature to select it.
- Examine the annotations for STY066 in the Details panel. If the Details panel is not visible, reveal it using View > Details.
As shown in the Details panel, the gene sequenced from the unknown, copper-resistant strain is a defective version of cutF (also called nlpE) which encodes a copper homeostasis protein that has been shown (Nishino, et al. 2010) to be associated with elevated multidrug and copper resistance in E. coli.
Need more help with this?