Lasergene Genomics Suite now includes access to the Variant Annotation Database (VAD) for human sequencing data. I recently spoke with DNASTAR Scientist, Dr. Tim Durfee about the VAD to get a better understanding of how the tool works and how it can help genomics and clinical researchers with their variant analysis.
Can you describe what the Variant Annotation Database is?
The VAD is a database resource that contains information on individual positions and alleles across the human genome. It is currently human genome specific. The major purpose of the VAD is to allow rapid prioritizing and ranking of the large number of variants found in any given sample relative to the reference genome. This can be on the order of thousands of variants for gene panels; tens of thousands for exomes; and millions for whole genomes. This kind of large-scale analysis is critical for the clinical sequencing market.
How can users access the information in the VAD?
Annotation information for each called variant in a specific sample is automatically retrieved from the VAD during project setup in ArrayStar. With the upcoming Lasergene 14.0 release, it will be added to the project directly following assembly and variant calling. The data is accessible in the ArrayStar SNP table and can be used to filter and create gene and SNP sets. For examples on how this can be done, take a look at our tutorial.
What is the source of the annotations in the VAD?
The data is from two major sources: the 1000 Genomes Project and dbNSFP (Database of Human Nonsynonymous SNPs and their Functional Predictions). As the name suggests, the dbNSFP data is on protein encoding positions in the genome. The data are organized into five broad categories:
- Allele and genotype frequencies from the 1000 Genomes phase 3 data as well as from NHLBI’s Exome Sequencing Project. The 1000 Genomes data is available as global frequencies as well as frequencies for 26 populations grouped into 5 super populations. This data is extremely useful for filtering. For example, if you’re studying a rare disease that only occurs in a small number of individuals, you wouldn’t expect a relevant SNP to occur at high frequency in the population – typically, you filter for variants that occur less than 5% or even less than 1% in the population.
- Functional impact prediction methods: LRT, MutationTaster, PolyPhen-2 (two models) and SIFT. The four methods use different strategies to predict whether a given non-synonymous change is deleterious to the function of the encoded protein.
- Evolutionary conservation scoring systems: GERP++, SiPhy, PhyloP and PhastCons. These methods use sequence alignments of the human genome with the corresponding regions of other organisms to produce scores of how conserved each particular base is across evolution. In coding regions, the more evolutionarily conserved the particular base is, the more likely having that base in that position is important for the function of the encoded protein. Some methods (e.g. GERP++) can also be used to assess the importance of bases outside the coding regions.
- Pathogenicity information from ClinVar: ClinVar is a central repository hosted by NCBI that catalogs and reviews human variation and its connection to disease. The VAD uses the clinical significance field to allow filtering on different classifications including Benign and Pathogenic.
- Miscellaneous information: The VAD also contains other types of information such as links to dbSNP Uniprot and Interpro that allow the user to easily retrieve additional data from those resources.
What are the advantages to using the VAD over a user’s own database or VCF file?
If a user has huge VCF files with the annotations, they would have to manually go through each position and retrieve the relevant information for that allele. With the VAD, all the annotations are automatically retrieved and readily available for filtering. The VCF is more useful as a record file of all the variants and their annotations that can be shared between applications. For example, a VCF of alleles of interest produced by ArrayStar can be used by SeqMan NGen in subsequent assemblies to report on those positions.
How does this compare to other tools on the market today?
The major advantage of DNASTAR’s Variant Annotation Database is the seamless connection with the assembly and variant caller. With open source software, you have to first run the assembly, do the variant calling with a separate program, and then use yet another tool to add the annotation information. There is often a steep learning curve with each of these tools, which can make the overall process laborious. The DNASTAR pipeline integrates all these steps into one suite and allows for multiple sample comparison and filtering.
Want to learn more? Check out our variant analysis workflow page to see videos and benchmarks on NGS assembly and variant analysis in Lasergene Genomics Suite.