Differential Gene Expression
If the genome is the same in all somatic cells within an organism (with the exception of the above-mentioned lymphocytes), how do the cells become different from one another? If every cell in the body contains the genes for hemoglobin and insulin proteins, how are the hemoglobin proteins made only in the red blood cells, the insulin proteins made only in certain pancreas cells, and neither made in the kidney or nervous system? Based on the embryological evidence for genomic equivalence (and on bacterial models of gene regulation), a consensus emerged in the 1960s that cells differentiate through differential gene expression. The three postulates of differential gene expression are as follows:
- Every cell nucleus contains the complete genome established in the fertilized egg. In molecular terms, the DNAs of all differentiated cells are identical.
- The unused genes in differentiated cells are not destroyed or mutated, and they retain the potential for being expressed.
- Only a small percentage of the genome is expressed in each cell, and a portion of the RNA synthesized in the cell is specific for that cell type. (Reference: https://www.ncbi.nlm.nih.gov/books/NBK10061/)
Identification of differentially-expressed genes involves the identification of genes that are differentially expressed in disease. In pharmaceutical and clinical research, DEGs can be valuable to pinpoint candidate biomarkers, therapeutic targets and gene signatures for diagnostics.
Methods for Differential Gene Expression Analysis
A gene is declared differentially expressed if an observed difference or change in read counts or expression levels between two experimental conditions is statistically significant. To identify differentially expressed genes between two conditions, it is important to find statistical distributional properties of the data to approximate the nature of differential genes. There are different methods for differential expression analysis such as edgeR and DESeq based on negative binomial (NB) distributions. There are a number of methods/algorithms that can be applied to scrutinize the significant genes from the RNA expression data. Depending on whether data is normalized or not, these methods can be applied. For instance, if data follow normal distribution, we can use various types of T-test method (Welch method, Wilcoxon, etc); if data is non-normalized (i.e. Raw read count, read count values), we can apply DeSeq, Deseq2, EdgeR, etc. Thus, depending upon the data type, one can choose a method. In the previous lesson, we already talked about how we can apply the T-test and how we can extend the RNA-seq pipeline to differential expression pipeline.
Now, let’s learn how we can perform Differential Gene Expression on T-Bioinfo Server. Here we will run the DeSeq2 pipeline and perform enrichment and GSEA analysis when we have gene expression in the form of raw count or read count values.
Differential Gene Expression on T-Bioinfo Server
To perform differential expression analysis, first select the “Differential Expression” option under the “Data Mining ” section on the T-bioinfo Server (https://server.t-bio.info/) as shown in figure below: After clicking on the “Differential Expression” option, you will be directed to the “File upload” interface. Click on the Data Upload button to upload your file and group samples based on the conditions under study.
After the completion of the pipeline, you will get a number of outputs:
- Volcano plot : Visual Representation of differential expressed genes
- Deseq_all.txt : Differential expressed genes in the Tabular format
- Enrichment and GSEA plots: You can find details regarding each plot in the next section.
Volcano plot : Visual Representation of differential expressed gene (as shown in the figure below).
- X-axis represent log2 fold change
- Y-axis represents the P-adjusted value
- Red dots on the plot represent the significant genes
- Blue dots represent the non-significant genes
- Red dots with Gene names represent the most significant genes
- Left side of the X-axis represent upregulated genes (+ive fold change ) in the “Group B” samples
- Right side of the X-axis represent downregulated genes (-ive fold change ) in the Group B
Gene Enrichment outputs
Besides identifying the significant genes, we can also determine the biological implication of these significant genes, i.e. these genes are associated with which pathways, biological processes or terms, etc. Based on the enrichment p-value, we can also assess whether genes are significantly enriched in a particular pathway or GO (Gene Ontology) term.
Now, let’s look at the main outputs that we obtained from the DESeq2 Enrichment GSEA pipeline. We will have plots (in .pdf or JPG format) for GO terms and pathways. Within each plot, you will find GO terms or pathways enrichment representation by various types of plots, i.e. Dot plot, UpSet plot, heatmap, CNet plot, etc. Although, all these plots within one PDF file represent the same information via different ways. Besides PDF files, you will also find .CSV file , where you will get exact term GO term or pathway ID with complete information which include Description, P-value, P-adjusted value, Enrichment score, Normalized Enrichment Score (NES), Genes (genes from our significant set that found to be there in that pathway or GO term), etc.
GO enrichment terms
You will find GO term enrichment results both in the PDF and .CSV file. Let’s look at the plot and understand what it represents. In the Dot plot for GO terms enrichment (shown in the figure below), we can see there are GO terms (Y-axis) corresponding to dots in the plot. Color of the dot represents the significance score (i.e. P-adjust value) for enrichment of genes in the GO term. Size of the dot represents the total number of genes in the GO term. Larger the dot, higher no. of genes and smaller the dot fewer no. of genes present in specific GO term. In the CSV file, we can also see specific genes associated with each term and exact P-adjusted value.
Gene Set Enrichment Analysis
After the list of significant genes is obtained, it is hard to interpret the data with just the genes. We know that genes are parts of complex biological processes, the proteins encoded by genes can be associated with different locations in the cells, or specific biological processes, such as DNA replication, of cytoskeleton growth. Expression of biologically related genes often changes in correspondence. Annotation of the genes in bigger groups helps interpret the data greatly. Several databases provide insight into biological processes and genes involved. GSEA operates on all genes in the set, differentially expressed or not. First published by Subramanian et al in 2005, it aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Unlike ORA which takes only DE gene names as input, GSEA takes phenotypic trait of all genes (DE or not) into account, in our case it is a statistic based on log-fold change between two phenotypic states for each gene.
Ridgeplot: The ridgeplot will visualize expression distributions of core enriched genes for GSEA enriched categories. It helps users to interpret up/down-regulated pathways. It is not available for ORA.
Cnet Plot & Upset Plot
CNet Plot: In the CNet plot, we can see the network of genes enriched in the GO terms. Color of node represents the fold change and size of node represents the size of genes (in GO term of pathway).
Upset Plot: It is an alternative to cnetplot for visualizing the complex association between genes and gene sets. It emphasizes the gene overlapping among different gene sets. In GSEA instead of the number of overlapping genes on the top panel, the distribution of log-fold changes is plotted.
To learn more about this exciting field and get hands-on practical experience with transcriptome data analysis, you can visit the OmicsLogic Learn portal: Course 5 on Transcriptomics: https://learn.omicslogic.com/courses/course/course-5-transcriptomics.
In this course, you will learn about transcriptomic (RNA-Seq) data Analysis, followed by downstream analysis of transcriptomic (RNA-Seq) data - from basic visualization to statistical analysis of differentially expressed genes using the popular DESEQ2 package. We will also speak about biological interpretation and annotation of gene sets. At last, you will learn about machine learning approaches, i.e. unsupervised and unsupervised machine learning methods. You will also learn how you can perform various analyses on data in R and python.
On the OmicsLogic Learn Portal, experts have also put together an example project with an application of transcriptomics technology. In this project, we will try to understand which molecular mechanisms are associated with liver cancer stage classification (comparison of stage1, stage2, stage3, and stage4) and how we can differentiate various risk factors responsible for liver cancer development. How the multi-omics data integration helps us to understand the association of clinical phenotypes of samples with the omics data.
Example Project on OmicsLogic Learn Portal : TCGA Liver Cancer
In this lesson, first, we will get an overview about the TCGA project and Liver Cancer. Next, we will see the overview of the Liver Cancer Project and learn about bioinformatics approaches to perform pan-cancer analysis. We will compare breast cancer and liver cancer to understand whether tissue type (tumor) associated molecular changes play a vital role and learn about the bioinformatics approaches that can be explored to classification tumor stage samples based on transcriptomics data.
Not only this, there are several other transcriptomic data analysis based research projects that you can go through and understand the practical aspects of transcriptomics.
So what are you waiting for ! Start your journey & dive into recent research advances with OmicsLogic training and research programs. For more details, visit: https://learn.omicslogic.com/. For any questions, you can reach out to us at email@example.com or firstname.lastname@example.org