When performing multiple calculations, the steps must be set up in a particular way. For example, if the input file for one step is the output file from an earlier step, you may obtain unexpected results. Specifically, the final output may have additional sequences and/or features.
Script A is an example of a workflow that can lead to such unexpected results:
- Step 1: Extract features from TEST.GBK and save them as EXTRACTED.GBK.
- Step 2: Translate EXTRACTED.GBK and save the results as TRANSLATED.GBK.
- Result: The file TRANSLATED.GBK may contain unexpected data (e.g., an extra translated feature).
This situation arises when there are overlapping CDS features. In such cases, a piece of one CDS will end up being annotated in the interval it shares with the second CDS. Then, when the EXTRACTED.GBK file is translated, that particular sequence will result in two protein sequences: the desired full length CDS, and the fragment of the overlapping CDS. Note that this is not an issue when the intermediate file is in FASTA format since there is no carryover of information about the overlapping CDS.
If you need the extracted CDS sequences to be in GenBank format instead of FASTA format, you should instead use the procedure described in Script B:
- Step 1: In the SeqNinja shell, enter the following:
:: extracted_CDS.fas = extract(genome.gbk,'CDS')
- Step 2: Leaving the SeqNinja shell open, open the resulting .starff file using any text editor and delete extraneous annotation lines. Save the updates.
- Step 3: In the SeqNinja shell, enter the following:
:: extracted_CDS.gbk = extracted_CDS.fas
:: translated_CDS.gbk = translate(extracted_CDS.gbk)
- Result: Overlapping fragments of CDS features are absent from the Genbank-formatted output file, which is the desired outcome.
Need more help with this?