16S RNA Amplicon Data Analysis (DADA2) on T-BioinfoServer

The pipeline is based on running a number of programs, including DADA2, Ape, and Phyloseq algorithms. DADA2 generates amplicon sequence variant (ASV) tables, which are similar to OTU tables but detailed in that they tabulate the number of identical amplicon sequence variants from different samples. Microbial studies utilizing DADA2 provide high resolution accurately reconstructed amplicon sequences that improve the detection of sample diversity and biological variants. 

Now let's have a look at an example Metagenomics pipeline on the T-Bioinfo Server: https://server.t-bio.info/pipelinesmetagenomicsqiime2/demopipelines/demo-chow-vs-hfd-mice and learn about the types of input files that should be uploaded, parameters chosen to run the pipeline, processing pipeline and finally what the output files look like. 

Input files required for processing the pipeline

To run the 16S RNA Amplicon pipeline, following are the optional parameters and type of input files that could be uploaded. 

  • Type of Reference Genome: Local, UserUpload
  • Format of NGS Data: fastA, fastQ
  • Single or Pair end reads: SE, PE
  • Databases: 16sRNA, VirusGenomes

To upload the input files, a user can upload the input file to run the pipeline in various formats as mentioned below:

  • The “txt” files can be uploaded directly under “Upload Files” option, or
  • Files could be uploaded from a “Link”, or
  • We can also upload the “NCBI Run Table” file, or
  • Upload “.txt”or”.svl” file to bulk import URLs

Add the supplementary file at the next stage and click on submit to run the pipeline

Pipeline on the T-Bioinfo Server

To run the pipeline we need to follow the following workflow:

Start > QC Filtering > Replication Count > Pair Merge > Cluster Consensus (OTU) > Remove Chimers > AssignTaxon > APE > Phyloseq > Data Visualization > End

DADA2 on T-Bioinfo Server

Lets now understand the functionality of each step in the pipeline. 

BEGIN: DADA2, a software package that models and corrects Illumina-sequencing amplicon errors. DADA2 infers sample sequences exactly, without coarse-graining into OTUs, and resolves differences of as little as one nucleotide. In several mock communities DADA2 identified more real variants and output fewer spurious sequences than other methods.  

This section provides a full sequence of methods to analyze 16s data and get visual outputs that help interpret.

DADA2: DADA - the Divisive Amplicon Denoising Algorithm - was introduced to correct pyrosequenced amplicon errors without constructing OTUs [7]. DADA was shown to identify real variation at the finest scales in 454-sequencing amplicon data while outputting few false positives.

Qc Filtering: DADA2 is a software package for analysis of pair-end metagenomics sequencing reads that was developed for merging reads, de-noising them and accurately combining them into OTUs. The first step is to filter reads. QC Filtering looks at the quality of reads at each nucleotide to determine a cut-off point for reads to consider.

Filtering of fastq files is a function that trims  sequences  to  a  specified  length, removes  sequences  shorter  than  that  length,  and  filters based  on the number of ambiguous bases, a minimum quality score, and the expected errors in a read. This can be done separately for the forward and reverse reads or jointly for both reads:

The DADA2 algorithm makes use of a parametric error model that is derived from each dataset. The algorithm alternates estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. This process begins with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors). 

DADA2 implements a new quality-aware model of Illumina amplicon errors. Sample composition is inferred by dividing amplicon reads into partitions consistent with the error model. You can read more about these steps in a detailed tutorial: https://benjjneb.github.io/dada2/tutorial.html or in the publication 

Replication Count: After reads are analyzed for quality and are trimmed in the same way, we need to eliminate reads that do not have a matched pair. This method  outputs a dereplicated list of  unique  sequences  and their abundances as well as consensus positional quality scores for each unique sequence by taking the average (mean) of the positional qualities of the component reads. DADA2 denoising algorithm uses the empirical  relationship  between  the  quality  score  and  the  error  rates. When reads are merged, this relationship will differ between the  forward-only,  overlapping,  and  reverse-only  portions  of  the merged  read.  That  variation  interferes  with  the  denoising  algorithm,  and  therefore  greater  accuracy  can  be  achieved  by denoising before merging.

Pair Merge: Merging is performed by aligning the denoised forward reads with the reverse-complement of the corresponding denoised reverse reads, and then constructing the merged “contig” sequences. By default, merged sequences are only output if the forward and reverse reads overlap by at least 12 bases, and are identical to each other in the overlap region. This function attempts to merge each denoised pair of forward and reverse reads, rejecting any pairs which do not sufficiently overlap or which contain too many (>0 by default) mismatches in the overlap region. Note: This function assumes that the fastq files for the forward and reverse reads were in the same order. 

Cluster Consensus (OTU): DADA2 Cluster Consensus constructs an amplicon sequence variant table (ASV) table, a higher-resolution version of the OTU table produced by traditional methods. The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants. This table contains ASVs, and the lengths of merged sequences all fall within the expected range for this V4 amplicon.

Remove Chimers: The core DADA2 method corrects substitution and indel errors, but chimeras remain. Fortunately, the accuracy of the sequence variants after denoising makes identifying chimeras simpler than it is when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences.

The frequency of chimeric sequences varies substantially from dataset to dataset, and depends on factors including experimental procedures and sample complexity. Here chimeras make up about 21% of the merged sequence variants, but when we account for the abundances of those variants we see they account for only about 4% of the merged sequence reads.

Assign Taxon: It is common at this point, especially in 16S/18S/ITS amplicon sequencing, to assign taxonomy to the sequence variants. The DADA2 package provides a native implementation of the naive Bayesian classifier method for this purpose. The Assign Taxonomy function takes as input a set of sequences to be classified and a training set of reference sequences with known taxonomy, and outputs taxonomic assignments.

The DADA2 package also implements a method to make species level assignments based on exact matching between ASVs and sequenced reference strains. Recent analysis suggests that exact matching (or 100% identity) is the only appropriate way to assign species to 16S gene fragments.

Phyloseq: The phyloseq package is a tool to import, store, analyze, and graphically display complex phylogenetic sequencing data that has already been clustered into Operational Taxonomic Units (OTUs), especially when there is associated sample data, phylogenetic tree, and/or taxonomic assignment of the OTUs. This package leverages many of the tools available in R for ecology and phylogenetic analysis (vegan, ade4, ape, picante), while also using advanced/flexible graphic systems (ggplot2) to easily produce publication-quality graphics of complex phylogenetic data. Phyloseq uses a specialized system of S4 classes to store all related phylogenetic sequencing data as a single experiment-level object, making it easier to share data and reproduce analyses. In general, phyloseq seeks to facilitate the use of R for efficient interactive and reproducible analysis of OTU-clustered high-throughput phylogenetic sequencing data.

More concretely, phyloseq provides:

  • Import abundance and related data from popular Denoising / OTU-clustering pipelines: (DADA2, UPARSE, QIIME, mothur, BIOM, PyroTagger, RDP, etc.)
  • Convenience analysis wrappers for common analysis tasks
  • 44 supported distance methods (UniFrac, Jensen-Shannon, etc)
  • Ordination –> many supported methods, including constrained methods
  • Microbiome plot functions using ggplot2 for powerful, flexible exploratory analysi
  • Modular, customizable preprocessing functions supporting fully reproducible work.
  • Functions for merging data based on OTU/sample variables, and for supporting manually-imported data.
  • Native R/C, parallelized implementation of UniFrac distance calculations.
  • Multiple testing methods specific to high-throughput amplicon sequencing data.
  • Examples for analysis and graphics using real published data.

With the Data Visualization job, you could view the integrated “Genome Visualizations”, which includes a, 2D PCA plot, 3D PCA plot taxonomic bar plot(showing the average relative abundance of each taxa at various taxonomic levels), and also the relative abundance of taxa to visualize your results and understand the abundance of microbial diversity. 

End: At the end of the pipeline, you would see several outputs, including OTU abundance, the OTU taxonomy and visualization outputs. 

Output Files: Obtained when pipeline processing is complete

After the pipeline has completed its processing, you will obtain a list of output files that could be downloaded to carry out statistical analysis and interpret biological insights. You will also obtain data visualizations in your output files that make sense to understand meaningful patterns or significant results. 

 

Relative Abundance of Taxa

Heatmap

Species abundance is the number of individuals per species, and relative abundance refers to the evenness of distribution of individuals among species in a community.

 

A heat map is a data visualization technique that shows the magnitude of a phenomenon as color in two dimensions. The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space.

Phylogenetic Tree (OTU)

Alpha Diversity Plot

A phylogenetic tree, also known as a phylogeny, is a diagram that depicts the lines of evolutionary descent of different species, organisms, or genes from a common ancestor.

Alpha diversity is the diversity in a single ecosystem or sample. The simplest measure is richness, the number of species (or OTUs) observed in the sample. Other metrics consider the abundances (frequencies) of the OTUs, for example to give lower weight to lower-abundance OTUs.

Alpha Diversity Plot

NMDS Plot

Chao1 estimates the number of species, whereas Shannon estimates the effective number of species.

 

NMDS plots are non-metric, meaning that among other things, they use data that is not required to fit a normal distribution. This is handy for microbial ecologists because the majority of our data has a skewed distribution with a long tail.

2D NMDS Plot

3D NMDS Plot

Taxa Abundance Bar Plot

Relative Abundance of Taxa

Taxa abundance bar plot represents the number of individuals per species.

Relative abundance refers to the evenness of distribution of individuals among species in a community.

To learn more about each section & get a practical hands on experience, get started with “Metagenomics” coursework on the OmicsLogic Learn Portal

Link to the Course: https://learn.omicslogic.com/courses/course/course-4-metagenomics  

For any questions, you can reach out to us at marketing@omicslogic.com or support@pine.bio