Creating the curated feature library for accurate vector auto-annotation


Lasergene Molecular Biology Suite now includes plasmid auto-annotation functionality. I recently spoke with DNASTAR Scientist, Dr. Guy Plunkett III, about this new capability to get a better understanding of how the tool works and why a curated feature library really makes a difference.
1. Tell me a little about DNASTAR’s new auto-annotation capability.
In our newest release, Lasergene 14.1, we’ve added the ability in SeqBuilder to annotate plasmids automatically using a carefully curated database of features, which allows us to do it extremely accurately. SeqBuilder looks at the sequence or sequences to annotate and then searches this curated database to report back with a list of matches – we allow the user to control the stringency of how a match is determined. The user then has the option to either accept all of the matches, or review them individually. We give the option to ignore features that aren’t wanted, and best of all – they can easily replace existing features with a more current and accurate version from the feature library.
2. Why is it so important to have a curated feature library?
The bottom line with annotation is garbage in equals garbage out, and for any annotation tool or pipeline to be worthwhile, it needs accurate underlying data. There are far too many examples of poor annotations being propagated to other sequences because they were the “best match” to a new unannotated sequence. A couple of cycles like that and the result could be worse than just useless — it could lead a user to wrong conclusions and/or failed experiments. A selection or counter-selection during cloning might fail because the functional sequence was not actually present in a vector. Or a function that could disrupt an experiment might be present, but not included in the annotations. Years ago I ran into a situation where all attempts to clone a particular sequence failed, and it eventually turned out that the host/vector combination I was using had an undocumented restriction system that was very active against the DNA I was trying to clone. The information was “published” in a vendor’s newsletter, but never made it into the strain documentation. These days when you get a new vector to use, there might be a sequence available with little or no annotation — just look at some vector sequences in GenBank! Our goal was to develop a tool that would allow you to know more about a vector, and ascertain up front whether it was even a good choice for your intended use.
 3. What process did you follow to create the curated library of features?
I started the creation of a feature library by harvesting all of the features annotated in our current Cloning Vectors catalog as individual sequences. That collection of sequences was then the starting point for a process of checking, validating, and curating. Feature sequences that, based upon their descriptions, were supposed to be the same thing were pooled and compared. For example, a beta-lactamase gene might instead be labelled as penicillinase, AmpR, ampicillin, amp, or bla – all ways of referring to the same sequence. Things that obviously were not what they claimed to be were put into a different pool, and in some cases eventually discarded as uninformative. Remaining “equivalent” sequences were compared using Lasergene’s alignment tools for pairwise and multi-sequence alignments as necessary.
 4. What problems did you encounter?
I found several issues like CDSs (protein-encoding sequences) missing start or stop codons, or not being modulo 3 (i.e., not containing a whole number of intact triplets), or being in the wrong reading frame. Some sequences were annotated on the wrong strand. I even found some purported drug resistance genes that were actually totally unrelated sequences. Part of this work was done using SeqBuilder and other Lasergene applications, and BLAST searches were sometimes required to resolve issues.
 5. How did you overcome them?
Throughout the process of building the features library, I went to literature any time I had a question, if existing annotations didn’t agree, or if something just didn’t make sense to me. Outside of protein-coding sequences, it was often not immediately obvious whether sequence differences in a set for the “same” feature were relevant or not. For example, given three sequences that overlap one another but differ, by length or sequence, beyond the bounds of a shared core sequence, I might infer that the overlap region is the important part. So the next step was to go to the literature — PubMed, Google Scholar, printed work, company catalogs, anything I could access, and as close to the original source as possible.
 6. Do you have any examples where the annotation improves the end product?
A number of eukaryotic vectors contain the ADH1 promoter, the promoter of the alcohol dehydrogenase I gene of the yeast Saccharomyces cerevisiae. The initially characterized promoter is a sequence of ~1500 bp, but most vector constructs contain truncated versions of ~700 bp or ~400 bp. They overlap, but they contain additional material upstream or downstream. By going to literature, I was able to ascertain that there are three functionally different versions of this promoter, which are included in our Feature Library:


  • p_ADH1a (1503 bp) full-length promoter of alcohol dehydrogenase I from cerevisiae; ethanol-repressed, high expression


  • p_ADH1b (705 bp) deregulated promoter of alcohol dehydrogenase I from cerevisiae; constitutive high expression


  • p_ADH1c (397 bp) truncated promoter of alcohol dehydrogenase I from cerevisiae; some vectors contain an adjacent region from pBR322 that has upstream activation sequence (UAS) activity in yeast, and those truncated versions provide constitutive, medium expression; in the absence of that fortuitous enhancer, this version provides only weak constitutive expression


Examination of competitors’ feature data sets revealed the presence of variations on both the ~700 bp and ~400 bp versions, but without indication that there is any functional difference. Using a vector annotated from such a source could lead to unexpected experimental results.
7. What is your background in this area and why are you qualified to do this work?
I have a background suited for this work — not with vector annotation per se, but with microbial genome annotation and the annotation process in general. I have over a quarter century of annotation experience including over 500 annotated sequences deposited in GenBank, and have done annotation work with academic colleagues, under government contract, and with commercial entities. This experience allows me to examine annotations and make sense of them even if I am not familiar with the source organism.
Want to learn more? Check out our plasmid auto-annotation page to see a video of the workflow in action.