The innovation of next-generation sequencing (NGS) technologies has enabled exponential growth of the production of high throughput omics data which is widely analyzed for identification of genomic variants including single nucleotide polymorphisms (SNPs) and DNA insertions and deletions (indels) in a spectrum of genetic-related disorders, and provide new insights into how genetic polymorphisms affect disease phenotypes. To facilitate this research, the T-Bioinfo platform hosts the variant calling pipeline that has been developed to enable researchers to accurately and rapidly identify, and annotate, sequence variants. Variant calling refers to the task of identifying possible variations in genome or transcriptome sequences with respect to a chosen reference sequence. In germline variant calling, the reference sequence is the standard for the species of interest. For somatic variant calling, the reference is the genome of a chosen control somatic cell sample.
The variant calling pipeline identifies single nucleotide variants present within the whole genome and exome data. The variants are identified by comparing the datasets of an individual with a reference sequence.
Now let's have a look at an example Variant Calling: Somatic & Germline pipeline on the T-Bioinfo Server: https://server.t-bio.info/pipelinesmutationvariant/demopipelines/breast-cancer-mutation-oncogene-variant-caling-tp53 and learn about the types of input files that should be uploaded, parameters chosen to run the pipeline, processing pipeline and finally what the output files look like.
Input files required for processing the pipeline
During genomic sequencing, DNA molecules are fragmented, and sequenced individually. Paired-end sequencing is a method in which reads are made from both ends of each fragment, towards the center of the fragment. This differs from single-end sequencing in which each fragment is only read from one end towards the other. Paired-end sequencing provides higher quality sequence data, allowing the identification of genomic rearrangements, and higher confidence in alignments. The paired-end read data is in FASTQ file format, which is designed for storing biological sequence data, along with data quality scores. A level of uncertainty exists about the type of each base pair (A, C, T, or G) during the sequencing process.
To run the Variant Calling pipeline, following are the optional parameters and type of input files that could be uploaded.
- Type of Reference Genome: RModelGenomeGTF, RGenome, RGenomeGTF
- Format of NGS Data: fastQ, fasta, SAM_BAM, VCF
- Single or Pair end reads: SE, PE
- Organism: To select reference genome
The data files we will use were taken from Sequence Read Archive (SRA), a repository for biological sequence and alignment data & the Metadata for each sequencing run is available at ArrayExpress. For subsequent analysis, it is valuable to store a quality measure for each base pair call.
To upload the input files, a user can upload the input file to run the pipeline in various formats as mentioned below:
- The “txt” files can be uploaded directly under “Upload Files” option, or
- Files could be uploaded from a “Link”, or
- We can also upload the “NCBI Run Table” file, or
- Upload “.txt”or”.svl” file to bulk import URLs
Pipeline on the T-Bioinfo Server
To run the pipeline we need to follow the following workflow:
Start > Bowtie2 > Visualization > FreeBayes > End
The variant calling pipeline consists of a series of interlinked sequential steps. Let's understand the functionality of each step in the pipeline.
Start: The mutation variant section provides various methods for processing next-generation sequencing data to detect genomic variants between groups or samples.
Bowtie2: Since our exome-seq data is in the form of segmented DNA sequences, the first step in identifying mutations is to match the reads (segments) to a reference sequence. This will allow us to highlight differences between the reference, which acts as a baseline, and the read sequences. This process is called sequence alignment, or mapping, and many approaches have been developed. Reference genomes have been created for a number of model species, including humans. These reference genomes are compiled from several individuals, and incorporate small variations from each.
Bowtie2 is a fast alignment algorithm based on the FM-index approach on seed substrings to align them to the genome. For more info please see: Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357-359. doi:10.1038/nmeth.1923.
This job takes reads in fq, fa files, aligns these onto the reference genome and gives mapping results in SAM format. Each row of the SAM file contains input read name, reference read name, position on the reference read and number of mapped/skipped/inserted/deleted positions
Visualization: In JBrowse Visualization, users can view genome annotations against a reference “ruler,” with an overhead bar giving a visual indication of chromosome position. The user can navigate by dragging the display left or right or by clicking on navigation buttons; analogous buttons allow the display to be zoomed in or out. Alternatively, users can navigate directly to a region (or feature) of interest by typing the region coordinates (or feature name) into a search box. Additional annotation tracks can be added to the current display by dragging them from a reservoir bar on the left of the screen, and can be removed by dragging them back off the main display. Tracks can, similarly, be reordered by dragging. All track manipulation, as with navigation, is live and requires no page reloads. Reference: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0924-1
Following alignment to the human reference genome, we are interested in identifying genetic variants in the cancerous cell lines, compared to the reference normal cell lines. At this point in the analysis, variant calling methods are employed. Some differences from the reference sequence are generally present in aligned read sequences, due to uncertainty in the sequencing and alignment processes. Variant calling algorithms search for significantly frequent variants in the aligned read data. Once again, many computational approaches have been developed. Some are specialized for identifying certain types of mutations. On T-Bioinfo Server, the most widely used variant callers are Freebayes and Mutect2
FreeBayes: FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.
FreeBayes is haplotype-based so that it calls variants based on the literal sequences of reads aligned to a particular target, not their precise alignment. This method avoids one of the core problems with alignment-based variant detection--- that identical sequences may have multiple possible alignments: FreeBayes uses short-read alignments (BAM files) for any number of individuals from a population and a reference genome (in FASTA format) to determine the most-likely combination of genotypes for the population at each position in the reference. It reports positions which it finds putatively polymorphic in variant call file (VCF) format.
For more information please see: Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012
Mutect2: MuTect2 is a method developed at the Broad Institute for the reliable and accurate identification of somatic point mutations in next generation sequencing data of cancer genomes. Mutect attempts to call mutations; it also generates a coverage file (in a wiggle file format, which indicates for every base whether it is sufficiently covered in the tumor and normal to be sensitive enough to call mutations). We currently use cutoffs of at least 14 reads in the tumor and at least 8 in the normal (these cutoffs are applied after removing noisy reads in the preprocessing step). In addition, wiggle files can also be generated of the observed depth in the tumor and in the normal.
Most cancer genome studies at the Broad Institute have made use of MuTect and have validated the mutation calls as a part of their cancer biology papers, showing that MuTect has a very low false positive rate.
Output Files: Obtained when pipeline processing is complete
After the pipeline has completed its processing, you will obtain a list of output files that could be downloaded to carry out statistical analysis and interpret biological insights. You will also obtain data visualizations in your output files that make sense to understand meaningful patterns or significant results.
VCF File: VCF (Variant Call Format) is a text file format generated during the variant calling process that contains genomic information and locations of variants in a group of sequenced samples. It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. There is an option whether to contain genotype information on samples for each position or not.
Mapping Stats Table:
Mapping refers to assigning/locating a specific gene to a particular region of a chromosome and determining the location of and relative distances between genes on the chromosome.
Mapping Stats table highlights the “Overall alignment rate” of the reads on the reference genome. Anything above 90% refers to a good alignment rate and can be considered for looking at the variants.
Now let's look at the tracks on the right side of the screen. The first few entries refer to the reads of your SRA Run Table. Next you can see three tracks which are “GRCh38NoPatch.gtf.sorted.gff”, “tp53_free_bayes_freebayes.vcf” and “Reference sequence (GRCh38NoPatch)”. So what are these tracks all about ? First of all let's understand what a track is.
Tracks: A track is the place to display your data files in JBrowse Linear Genome View. For this project we're going to add a gene track, an alignments track, and a variants track. We're just going to use basic configuration.
Reference sequence (GRCh38NoPatch)
This track includes the GRCh38 reference sequence on which the reads are aligned for variant calling.
GRCh38NoPatch.gtf.sorted.gff (Gene Track)
Many bioinformatics programs represent genes and transcripts in GFF format (General Feature Format) which simply describes the locations and the attributes of gene and transcript features on the genome (chromosome or scaffolds/contigs).
Tp53_free_bayes_freebayes.vcf (Variation and Coverage)
JBrowse highlights which reads have mapping issues, and any changes in bases between the reads and the genome and “auto generates a SNP track”, which produces an extra track we can enable in JBrowse. This track reads the same VCF file used for visualizing reads, and then produces a SNP and coverage visualization.
To learn more about each section & get a practical hands on experience, get started with “Genomics” coursework on the OmicsLogic Learn Portal.
Link to the Course: https://learn.omicslogic.com/courses/course/course-3-genomics
For any questions, you can reach out to us at firstname.lastname@example.org or email@example.com