Bioinformatics for Cancer Genomics

What Is Cancer Genomics?

The advances in next generation sequencing (NGS) technologies have revolutionized the way we understand the role of DNA in disease. Specialized research leveraging such data in oncology gave rise to a whole field of study called cancer genomics. Cancer Genomics is an interdisciplinary field of biology and medicine focusing on genomic drivers of cancer onset and progression as well as ways to diagnose and treat cancer at the molecular level.

 

What is Genomics?

Genomics is a study of the structure, function, evolution, mapping, and editing of genomes. It is only natural to confuse the term genomics with two other similar and widely popular terms - “genome” and “genetics”. A genome is the collective set of an organism’s genetic material. Genomics aims at the collective characterization and quantification of all the organism's genes, their interrelations and influence on the organism. In contrast to the above mentioned terms, genetics refers to sequencing and studying targeted genes and their roles in inherited traits. 

DNA

Image Source - https://cdn.pixabay.com/photo/2016/11/09/15/27/dna-1811955_960_720.jpg 


How Bioinformatics Plays A Role In Cancer Genomics?

Nowadays, large amounts of data are being generated through high throughput technologies as a part of cancer research. However, the sheer volume of the data generated makes it difficult for the researchers to mine the data to find answers to complex biological questions. To overcome this problem, bioinformatics has paved the way to help store, analyze, integrate, access, and even visualize large volumes of complex biological data by making use of advanced mathematical and computing technologies. This enables the researchers and scientists to understand the genomic underpinnings of a healthy cell and how their alterations result in cancers. 

In recent years, several studies have reported that germline copy number variations are associated with an increased risk of developing cancers including neuroblastoma, colorectal cancer and breast cancer by altering gene expression levels that can range from nucleotide level to an entire chromosome. Detection of copy number variants in tumor genomes using computational tools and algorithms is one great example where bioinformatics finds its application in cancer genomics. In the topics below, we will discuss in detail where to collect data on tumor genomes and how to build a copy number variant detection pipeline using T-Bioinfo Server. Before building our copy number variant detection pipeline, it is important that we know a few key aspects about the T-Bioinfo Server. 

 

Introducing The Tauber Bioinformatics Platform

The Tauber Bioinformatics Platform, known as T-BioInfo, is a cloud based or locally hosted suite of Big Data analysis tools. The platform provides user-friendly and intuitive access to multiple standard and custom data analysis methods to process various omics datasets like Genomics, Transcriptomics, Epigenomics, Metagenomics, Mass-Spectrometry, Metabolomics and Proteomics as well as Structural Biology. The platform also provides analysis and integration capabilities for heterogeneous datasets using supervised and unsupervised machine learning.

 

The goal of T-BioInfo and Pine Biotech is to allow everyone access to a flexible, intuitive environment where big data is meaningful and is useful for decision-making. The platform is designed to eliminate dependency on bioinformaticians and streamline the way big data is collected, analyzed and interpreted. To know more, click on the link provided: https://server.t-bio.info/     

 

T-Bioinfo platform



The T-BioInfo platform provides several genomic/epigenetic detection modules including ChIP-seq/chip: Analysis of Chromatin Precipitation Data; Bisulfite DNA Methylation; Mutation Variant: Parallel Analysis of Mutation Variant Data; Copy Number Variation Segmentation; and finally Denovo Genome Assembly. For the purpose of this activity we are going to use the Copy Number Variation Segmentation Pipeline.  

 

Activity - Building A Copy Number Variant Detection Pipeline For Tumor Genomes On T-BioInfo Platform

Copy Number Variation (CNV) is a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals in the human population.  In this activity, you are going to learn how to identify copy number variations from Whole Exome Sequencing data collected from two individuals with primary breast ductal carcinoma. For each individual you will use data generated from the tumor genome (cancer cells) and their germline DNA (peripheral blood cells). 

 

Dataset For Copy Number Variant Detection Pipeline

The dataset you are going to use in this activity is a part of the study conducted by Gazdar AF, Kurvari V, Virmani A, et al. that aimed at characterization of paired tumor and non-tumor cell lines established from patients with breast cancer. Next Generation Sequencing (NGS) data were collected from two individuals with primary breast ductal carcinoma. From each individual, cell lines were generated from cancerous (taken from the tumor) and non-cancerous (taken from the blood) cell samples. To identify the variants associated with the cancer, Whole Exome Sequencing was performed on the cell lines. 

 

Information on the characteristics and culture methods of the data you are going to use are available from the American Type Culture Collection (ATCC) website. The cell lines of the tumor and normal pair are listed below:

  1. HCC1143 (Basal breast cancer, SRR925765) 
  2. HCC1143BL (Matched normal – B lymphoblast, SRR925766)

Here are a few key points to keep in mind. The HCC codes are identifiers for the cell lines in the ATCC database. The SRR codes are identifiers for the sequence data from the respective cell lines. Sequencing was done using an Illumina Genome Analyzer. The pair-end read data (FASTQ files) exist in the SRA database under these file names:

 

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR925/SRR925765/SRR925765_2.fastq.gz

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR925/SRR925765/SRR925765_1.fastq.gz

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR925/SRR925766/SRR925766_2.fastq.gz

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR925/SRR925766/SRR925766_1.fastq.gz

 

_1 and _2 at the end of each file name identify the pair end reads. 

For the purpose of this activity, the data has already been stored on the T Bio Server. Therefore, downloading them onto your computer is not required. 

 

Building The Copy Number Variant Detection Pipeline

To get started, first you should sign up and create an account to have access to the T-BioInfo platform. To create an account, sign up using the link provided: https://server.t-bio.info/

Note that signing up for the T-BioInfo platform for free will only give you access to run the demo pipeline where the datasets are already uploaded. To know more on how to run your own pipelines and analyze datasets using the T-BioInfo platform, kindly contact marketing@omicslogic.com

For those who are running the demo pipeline, you can skip to the fifth point to start building the copy number variation detection pipeline. If you have access to run your own pipelines, follow the steps mentioned below.  

  1. Select the “CNV segmentation” under the “Genomics/Epigenetics” field on the T-BioInfo platform. Then, you will be redirected to the page where the tools that are required to build the CNV detection pipeline are present.

 

CNV Segmentation

Image depicting CNV segmentation 

 

  1. Next, the preliminary data information needs to be indicated before continuing to the algorithmic part of the pipeline.
  • Organism - The cell lines are human cell lines and thus you will choose HomoSapiens_GRCh37. 
  • Type of Reference Genome - This is an annotated reference genome, so you will use the option “ModelGenomeGTF”. 
  • Format of NGS data - Our input files are “fastQ”. 
  • Single or pair end reads - These are Paired End (PE) reads.

 

Preliminary data

Image depicting Preliminary Data 

 

  1. Now, you can upload our files. The file paths are specified using the ‘SVL’ link option. The following file paths should be pasted in the two file link fields.
  • /export-data/sciservice/t22/SRR925765.sra – cancer
  • /export-data/sciservice/t22/SRR925766.sra – normal

 

Dataset File Upload

Image depicting File Upload

 

  1. The next step is to define the groups. When you click on group selection, you will see an empty box labeled “Sample”. Select the ‘+’ next to the label to create a Control group. Now drag and drop the samples into the desired groups. 

Group Selection

Image depicting Group Selection

 

2. Once grouping is done, you can proceed with the development of your pipeline. When you click Start, you can see many options highlighted. The first step of analysis is the PCR Clean to remove reads that originated from sequencing the PCR duplicated molecules. 

3. The reads are then aligned to the reference genome using the Bowtie2 algorithm. 

4. For identifying differentially covered genomic loci (putative CNV regions) you will utilize the BinSCNV and control-FREEC algorithms. 

5. Click on uTable for the unification of the results. 

6. Finally, click on End option, enter a name for the pipeline and click on Run Pipeline.  

7. Once the pipeline is complete, you can download and view the output files.

 

Pipeline Graph

Image depicting Copy Number Variant Detection Pipeline Graph

 

Copy Number Variation Detection Pipeline Output

Once you finish running the pipeline, you can examine the results obtained from the Output Files section. Amongst the output files that you have obtained, the one you will be using for analyzing the results is a file that ends with the following name - “.bam_CNVs.p.value.txt”. This file contains the information on the identified copy number variants. 

 

Output files

Image depicting Copy Number Variation Detection Pipeline Output

Bed Format File

Image depicting “.bam_CNVs.p.value.txt”

 

The “.bam_CNVs.p.value.txt” file is bed-format file and contains the following information - genomic positions of the identified loci, the number of copies gained in the sample (compared to control), information whether the change of copies is a gain or a loss in number, and two types of p-values derived from two independent statistical tests: Wilcoxon Rank Sum and Kolmogorov-Smirnov.

The resulting file can be difficult to navigate. Therefore, you can use external software such as Integrative Genomics Viewer (IGV)  to visualize your data. IGV is a freely available visualization tool for large-scale genomic data. To perform an analysis, IGV must be installed locally on your computer. You can find more on IGV, such as installation instructions and documentation here: https://software.broadinstitute.org/software/igv/.

 

Visualizing Pipeline Output Using Integrative Genomics Viewer

Note that before loading your file into IGV, it is important to remove the first row (the header line, with column names). Once you are done, before loading your file you should load hg19 assembly (the equivalent to GRCh37 that you used for alignment). Then you can load your output file by clicking on “File” and then “Load from Files”. IGV even provides an option to examine specific gene groups that participate together in various pathways.

 

Integrative Genomics Viewer

Image depicting visualization of “.bam_CNVs.p.value.txt” using IGV

 

In IGV, if you click on “Regions” and then “Gene Lists”, a new window will appear that gives you the chance to choose from pre- selected gene groups, or you can make your own gene group, to view these genes in parallel. For the sake of this activity, you will choose from the “Example gene lists” the “p53 signaling pathway”

 

p53 signaling pathway is an important pathway. p53 is a tumor suppressor protein encoded in humans by the TP53 gene. It is a crucial component in multicellular organisms, as it regulates the cell cycle and helps prevent cancer. p53 has many mechanisms of anticancer function, and plays a major role in inhibition of angiogenesis. Moreover, it is the most frequently altered gene in human cancers. In the above given image, you are visualizing the copy number variations in the genes associated with the p53 signaling pathway.  As you can see in the image - TP53 has 6 copies, MDM2 has 10 copies, MDM4 has 5 copies and CDKN2A has a loss of copy number.

Similarly, Nandakumar S, et al. identified a subset of prostate cancers with distinct genomic and clinical characteristics based on pan-cancer analysis of CDK12 alterations. The link to the dataset has been provided here - http://www.cbioportal.org/study/summary?id=prad_cdk12_mskcc_2020. Learners are encouraged to perform Copy Number Variation Detection pipeline using this dataset. All this evidence indicates that further research into copy number variations has a huge potential to unveil clues for prevention, diagnosis and treatment of cancer. 

The above provided activity has been discussed in great detail on Course 3: Genomics that is available on OmicsLogic Training Platform. To know more about this course, continue reading further.  

 

Genomics Coursework

Image Source - https://learn.omicslogic.com/courses/course/course-3-genomics 

 

Overview Of Course 3: Genomics

During my undergraduate days, I struggled a lot in grasping concepts in genomics. However, as a part of my project it was important that I cover topics in genomics to read and analyze research papers related to my area of interest. 

I started with Lesson 1 - DNA Structure & Variants of Course 3: Genomics, this course helped me gain the basic understanding of DNA molecules and covered topics on DNA structure, organization and its variants. Next, I started Lesson 2 - Working with DNA Sequences in R which taught me how to align DNA sequences by various alignment techniques and interpret their similarities by using R programming language. Then, I started with Lesson 3 - Introduction to Genomics. The module on genomic variations in this lesson is very important as we will use this concept throughout our upcoming lessons.

Lesson 4 - DNA Replication and Reverse Complements in R explains in great detail about various steps involved in DNA replication in eukaryotes. Further, it also taught me how to use R programming language to read a sequence and create its reverse complement.  Lesson 5 - Genomic Variation in NGS Data: Practical provided me with an opportunity to perform somatic variant calling analysis on real life primary ductal carcinoma data. Lesson 2 teaches you how to align DNA sequences. But how do we know how closely related these genomes are? That is where Lesson 6 - Phylogenetic Analysis comes into play. It taught me how to phylogenetically compare the genome sequences using an example of SARS-COV-2!  

Lesson 7: Comparing DNA sequences using Phylogenetic Analysis teaches you how to visualize the relationship between samples based on their sequences using Neighbor-Joining algorithm and H-clustering. If you have ever performed multiple sequence alignment, you would know how difficult it is to navigate the resulting output file. Lesson 8: Multiple Sequence Alignment - visualizing and filtering alignment teaches you how to visualize and filter the table and finally create a report on important variants in specific positions of the reference genome. Lesson 9: Advanced Genomics Overview further expands your knowledge on coverage and its significance in single nucleotide variant analysis and various types of mutation data file formats.   

 

The significance of performing mutability analysis to understand the variants has been explained in great detail in Lesson 10: Mutability Analysis & Interpretation by taking an example of malaria infection caused by Plasmodium falciparum in Navrongo across the years 2009 and 2015. Further Lesson 11: Differential Mutation Analysis explores which chromosomes or genes have variants between both groups of samples - 2009 and 2015. This finally brings us to the last lesson of this course, Lesson 12: Copy Number Variation Analysis

To know more about how to visualize and interpret the resulting output from copy number variant detection pipeline, I would highly recommend the learners to go through Course 3: Genomics - https://learn.omicslogic.com/courses/course/course-3-genomics  

 

We also have an upcoming Omics Research Symposium 2022 on March 31 & April 01, 2022 where Dr. Gabriel Golzcer-Gatti, Data Scientist at CAMP4 Therapeutics, Cambridge, USA will talk about how NR4A1 regulates expression and suppresses replication stress in cancer. Interested to participate in our research symposium? Click on the link provided to register for free: https://edu.omicslogic.com/omics-research-symposium-2022

If you are interested in the applications of bioinformatics for cancer genomics, we have an upcoming online workshop on "Detailed Analysis of Cancer Mutations using Next-Generation" on March 22, 2022, at 10 AM CST. The topics covered include tools for Genomic Data Analysis to interrogate genomic variations for cancer studies and to perform somatic and germline variant detection, driver and passenger mutations, copy number variation as well as tumor subclonal deconvolution. Participants will have opportunities to present and discuss case studies and interact with the Omics Logic Experts. To know more, kindly visit the page - https://edu.omicslogic.com/omics-research-symposium-workshop 

 

Have no previous knowledge of bioinformatics? Worry not! If you are a beginner to bioinformatics like I was, I recommend you to get started with the following two courses available on OmicsLogic Training Platform:

To know more about the various courses and programs offered by OmicsLogic, click on the link provided here - https://learn.omicslogic.com/ or write to us at marketing@pine.bio. Happy learning!

 

References

  1. Berger, M. F., & Mardis, E. R. (2018). The emerging clinical relevance of genomics in cancer medicine. Nature reviews. Clinical oncology, 15(6), 353–365. https://doi.org/10.1038/s41571-018-0002-6 

  2. Bioinformatics, Big Data, and Cancer. Published by the National Cancer Institute. https://www.cancer.gov/research/infrastructure/bioinformatics 

  3. Katayoon Kasaian, Yvonne Y. Li, Steven J.M. Jones, Chapter 9 - Bioinformatics for Cancer Genomics, Editor(s): Graham Dellaire, Jason. N. Berman, Robert J. Arceci, Cancer Genomics, Academic Press, 2014, Pages 133-152, ISBN 9780123969675, https://doi.org/10.1016/B978-0-12-396967-5.00009-8
  4. Gazdar, A. F., Kurvari, V., Virmani, A., Gollahon, L., Sakaguchi, M., Westerfield, M., Kodagoda, D., Stasny, V., Cunningham, H. T., Wistuba, I. I., Tomlinson, G., Tonk, V., Ashfaq, R., Leitch, A. M., Minna, J. D., & Shay, J. W. (1998). Characterization of paired tumor and non-tumor cell lines established from patients with breast cancer. International journal of cancer, 78(6), 766–774. https://doi.org/10.1002/(sici)1097-0215(19981209)78:6<766::aid-ijc15>3.0.co;2-l 

  5. Nguyen, B., Mota, J. M., Nandakumar, S., Stopsack, K. H., Weg, E., Rathkopf, D., Morris, M. J., Scher, H. I., Kantoff, P. W., Gopalan, A., Zamarin, D., Solit, D. B., Schultz, N., & Abida, W. (2020). Pan-cancer Analysis of CDK12 Alterations Identifies a Subset of Prostate Cancers with Distinct Genomic and Clinical Characteristics. European urology, 78(5), 671–679. https://doi.org/10.1016/j.eururo.2020.03.024