How to use VCF files for variant analysis and genotyping
Annotated data is essential for assessing the importance of variants to a given trait or disease under investigation. This is true whether you are looking for known variants in a single sample or comparing across multiple samples for shared variants or affected genes. Variant call format (VCF) files provide a compact, human-readable method of storing variant information from one or multiple samples that share the same reference sequence. Unfortunately, they do not include important functional information about individual variants such as the impact on a gene. For example, VCF files do not indicate whether or not the SNP changes the amino acid sequence of a protein-encoding gene. This information is critical in assessing the potential impact of a sequence variant.
In order to provide basic functional annotation and enriched information for human samples, DNASTAR has developed a new workflow: Variant Annotation in Lasergene Genomics. This workflow allows users with human-based VCF files to annotate them using DNASTAR’s Variant Annotation Database (VAD).
What are Variant Call Format files, and where do they come from?
VCF files are produced by an assembly pipeline following alignment to the reference sequence by an aligner such as BWA, and variant calling by a tool such as GATK. Each new sample is sequenced and aligned against a reference genome, and variants are called based on that alignment.
VCF files often originate from the alignment of Illumina data, but some sources (e.g., Genome in a Bottle), combine data from multiple sequencing technologies to make higher confidence calls.
VCF files provide a convenient way to archive and share basic SNP and small indel information such as the variant base(s) and the chromosome and position where the variant is located. These files are especially useful for large scale data involving a whole genome.
Multiple sample VCFs, such as those available from the 1000 Genomes project, have variant calls from multiple individuals. These samples were originally sequenced independently, then later combined into a single file. Since these are a coalesced version across multiple samples, they will also contain information where one or more samples do not have a variant at any given position. In other words, only one sample needs to have a variant at a position to be included in the file. This is the same idea as we use in ArrayStar to provide information on reference calls across a population of samples.
How does the variant call format annotation process work?
Users load their VCF file(s) into SeqMan NGen using the Variant Call Format (VCF) analysis workflow (see image to right). The software then annotates variants via a two-step process. The first annotation step classifies variants by their effect on coding regions, relative to the imported reference genome. The second annotation step includes import of the DNASTAR Variant Annotation Database (VAD), which combines data from a variety of SNP level annotation databases.
The image below shows the steps involved in the VCF annotation workflow.
After following the VCF Annotation workflow in SeqMan NGen, where is downstream analysis performed?
The workflow produces a faux assembly with annotated variants. These variants can be viewed in SeqMan Pro or SeqMan Ultra. However, they are more commonly viewed in ArrayStar, where all the standard variant comparison tools and views are available.
In ArrayStar users can filter for genes and/or variants of interest across multiple samples and use the many cross-comparison tools. The “Add/Manage Columns” tool lets users add useful data columns from the Variant Annotation Database to any of the ArrayStar tables (see image on right).
How does the Variant Annotation workflow differ from simply adding an uploaded VCF file to any reference-guided workflow?
In SeqMan NGen’s reference-guided workflows, uploaded VCF files are used to tag positions of interest in a new NGS assembly. These positions are assigned a user ID that can be used in filtering in SeqMan Pro, SeqMan Ultra, or ArrayStar.
By contrast, the VCF Annotation workflow annotates specified VCF files using the corresponding annotated reference sequence without the use of NGS read data.
It is important to note that the reference sequence used MUST be in the same coordinate system as was used to generate the VCF; otherwise the results will be erroneous. For example, a VCF generated with build 37 of the human genome must be compared to the annotated build 37 human reference, not the newer build 38 sequence.
What type of researcher could benefit from the VCF Annotation workflow?
This workflow is designed to help scientists researching variants that may be involved in a particular human disease or trait. For example, in a recent poster, we used this workflow to analyze 96 targeted resequencing samples from a Chinese cohort with lung squamous cell carcinomas (LSCC, Li et al., Sci Rep. 2015. 5:14237). We were able to easily identify unique mutations in numerous samples across the cohort which all lead to nonsense and frameshift mutations in the TP53 tumor suppressor gene.
Some researchers may use the workflow to update their own sample annotations with a newer version of the annotated reference genome. Others, including those who do not normally use SeqMan NGen can also benefit from this workflow. A common scenario is to collect VCF files from colleagues or public resources and annotate them for the purposes described above.