Single Cell RNA-Seq on T-Bioinfo Server

Single-Cell RNA-Seq provides transcriptional profiling of thousands of individual cells. This level of analysis enables researchers to understand at the single-cell level what genes are expressed, in what quantities, and how they differ across thousands of cells within a heterogeneous sample. The analysis consists of methods for dimensionality reduction, clustering and annotation of cells by type. To achieve this objective, the data has to be ready for analysis. Most of the time, samples are integrated, placed in a smaller number of dimensions and then clustered. Clusters represent some biologically distinct groups of cells that can be assigned with a known cell type by expression levels of marker genes. 

Now let's have a look at an example Single RNA Seq pipeline on the T-Bioinfo Server: https://server.t-bio.info/pipelinessinglecellrnaanalysisinseurat/demopipelines/demo-seurat10x-breast-cancer and learn about the types of input files that should be uploaded, parameters chosen to run the pipeline, processing pipeline and finally what the output files look like. 

INPUT FILES:

To run the Single Cell RNA-Seq pipeline, following are the optional parameters and type of input files that could be uploaded. 

  • Digital Gene Expression
  • 10xGenomics

File Types

To upload the input files, a user can upload the input file to run the pipeline in various formats as mentioned below:

Three options are available here to upload the file, as:

  • Regular Upload
  • SVL Upload
  • Bulk Upload

If we opt for Regular Upload then we should separately upload the “barcodes” file, “genes file”, “matrix file” and the “metadata file”. 

With the “Bulk Upload” option we can simply upload the “svl file” for all samples along with the metadata file and proceed with running the pipeline. 

Finally upload the metadata file and submit the files for running the pipeline. 

PIPELINE:

To run the pipeline we need to follow the following workflow:

Start > QC > SCT Transformation > PCA > Find Clusters > UMAP > Find All Markers > Marker Plots > CellDex for Humans/Mouse > DE between clusters

Lets now understand the functionality of each step in the pipeline. 

Start: There are cases when bulk RNA sequencing is not enough to research the issue, especially in immunology where different cell types have to be studied separately. Single-cell RNA sequencing is a great technique for this task. 

       Several studies were performed on cancer: breast cancer, ovarian cancer, colorectal cancer and others (Lambrechts, 2020, Kieffer, 2020, Qi, 2021). Single-cell sequencing (scRNA-seq) also made a big impact on immunology (Chen, 2019).

       The workflow of the scRNA-seq can be separated into experimental design and data analysis. Experimental design consists of single cell capture, mRNA capture and reverse transcription, cDNA amplification and library preparation, followed by sequencing. Reads obtained from platforms like Illumina 10x Genomics, or Illumina Drop-seq are being trimmed, mapped to the reference genome, then the genes are quantified in the alignment. This results in an expression matrix, where the columns are the cells, and rows are the genes. So, for each cell we have its expression count for every gene that was detected.

Quality Control: This Module allows to filter the single cell expression data to only include true cells that are of high quality. The metrics that are commonly used to identify high quality cells are:

  1. Cells in which barcoding or amplification reactions were not successful will have lower numbers of UMIs and genes detected. Lower numbers of detected genes can hinder downstream analyses such as clustering because the genes that are able to distinguish cell populations may not be adequately measured. Often, these cells are excluded by setting a minimum threshold on the number of genes detected. 

  2. Another aspect unique to droplet-based microfluidic devices is that many of the droplets will not contain an actual cell or will contain multiple cells. Despite the absence of a cell, these “empty droplets” may contain low levels of background ambient RNA that was present in the cell solution. In order to exclude the “empty” or “doublets” droplets, it is recommended to set a threshold on the minimal and maximum amount of expression (counts) in each cell. 

  3. Expression of mitochondrial genes can indicate cells that are stressed due to experimental procedures. These cells can be filtered out with the parameter. However, this parameter has to be used cautiously - different tissues are expected to have different % of mitochondrial genes expressed, as well as different organisms

For detailed review of biases of scRNA-seq please read: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4758103/ 

SCT Transformation: Hafemeister and Satija, 2019 introduced a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiment. This procedure omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression.

PCA: Principle Component Analysis (PCA) is performed on the scaled data. By default, only the previously determined variable features are used as input, but if all features were chosen as an argument for Scale Data, PCA will be performed on all scaled features.

Find Clusters: Clustering procedure relies on graph-based clustering approaches to scRNA-seq data (Xu and Su 2015; Levine et al. 2015). Briefly, these methods embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘communities’. A KNN graph is constructed based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). To cluster the cells, modularity optimization techniques such as the Louvain algorithm (default) are applied. These techniques iteratively group cells together, with the goal of optimizing the standard modularity function. A resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters. Setting this parameter between 0.4-1.2 typically returns good results for single-cell datasets of around 3K cells. Optimal resolution often increases for larger datasets.

Manifold UMAP: A nonlinear dimensionality reduction technique, a Uniform Manifold Approximation and Projection (UMAP) helps visualize and explore the dataset. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. For an in depth discussion of the mathematics underlying UMAP, see the ArXiv paper.

Find All Markers: Finds markers (differentially expressed genes) for cells in each cluster compared to all other clusters' cells. These genes are found using non-parametric Wilcoxon Rank Sum test (sometimes also referred to as Mann-Whitney U test), and is based on the order in which the observations from the two samples fall. 

Marker plots: Marker plots allow users to visualize (dimplot) the markers chosen after running FindAllMarkers to see their distribution in the UMAP/tSNE graphs (on the right). Also violin plots are being generated to see the expression pattern across the clusters (left).  

Celldex for Human/Mouse: Provides a collection of reference expression datasets with curated cell type labels. Only available for human and mouse data. For human data the available databases are: BlueprintEncodeData, HumanPrimaryCellAtlasData, MonacoImmuneData, and NovershternHematopoieticData. For mouse data the available databases are ImmGenData and MouseRNAseqData.

OUTPUT FILES:

After the pipeline has completed its processing, you will obtain a list of output files that could be downloaded to carry out statistical analysis and interpret biological insights. You will also obtain data visualizations in your output files that make sense to understand meaningful patterns or significant results. 

 

Feature_Plot

Vln_plot

The featurePlot function is a wrapper for different lattice plots to visualize the data.

A violin plot is a hybrid of a box plot and a kernel density plot, which shows peaks in the data. It is used to visualize the distribution of numerical data. Unlike a box plot that can only show summary statistics, violin plots depict summary statistics and the density of each variable.

Cluster_1_best_12_genes_vlnplot

folder_Dimensionality_step_DimPlot

Representation of top 12 differentially expressed genes in a cluster as Violin Plot

Graphs the output of a dimensional reduction technique on a 2D scatter plot where each point is a cell and it's positioned based on the cell embeddings determined by the reduction technique. By default, cells are colored by their identity class (can be changed with the group.by parameter).

DotPlot_by_group_cluster_1

 

A dot plot, also known as a strip plot or dot chart, is a simple form of data visualization that consists of data points plotted as dots on a graph with an x- and y-axis. These types of charts are used to graphically depict certain data trends or groupings.

 

To learn more about each section & get a practical hands on experience, get started with “Single Cell Transcriptomics” coursework on the OmicsLogic Learn PortalLink to the Course: https://learn.omicslogic.com/courses/course/course-6-single-cell-transcriptomics For any questions, you can reach out to us at marketing@omicslogic.com or support@pine.bio