Cancer Gene Expression Microarray Data Analysis

Problem Statement

ICMR wants to analyze different types of cancers, such as breast cancer, renal cancer, colon cancer, lung cancer, and prostate cancer becoming a cause of worry in recent years. They would like to identify the probable cause of these cancers in terms of genes responsible for each cancer type. This would lead us to early identification of each type of cancer reducing the fatality rate.

Dataset Details

The input dataset contains 802 samples for the corresponding 802 people who have been detected with different types of cancer. Each sample contains expression values of more than 20K genes. Samples have one of the types of tumors: BRCA, KIRC, COAD, LUAD, and PRAD.

Tasks

Week 1

Plot the merged dataset as a hierarchically-clustered heatmap to see if the five classes of genes are shown distinctively to see their existence
Apply the feature selection algorithms, filter the actual dataset for selected columns and save it as a new DataFrame to represent the feature selection data

Week 2

Each sample has expression values for around 20K genes. However, it may not be necessary to include all 20K genes’ expression values to analyze each cancer type. Therefore, identify a smaller set of attributes which will then be used to fit multiclass classification models using dimensionality reduction techniques.
Filter the actual dataset for the columns (or genes) suggested by this approach and save it as a new DataFrame to represent the dimensionality reduction data.

Week 3

Identify groups of genes that behave similarly across samples and identify the distribution of samples corresponding to each cancer type.

First, apply the given clustering technique on all genes to identify:
- Genes whose expression values are similar across all samples
- Genes whose expression values are similar across samples of each cancer type
Next, apply the given clustering technique on all samples to identify:
- Samples of the same class (cancer type) which also correspond to the same cluster
- Samples identified to be belonging to another cluster but also to the same class (cancer type)

Week 4

Build a robust classification model for identifying each type of cancer. Try variants of SVM, Random Forest and Neural Networks on original, selected and extracted features and evaluated on AUC score.
Write an observation based on your analysis of the best models considered in the previous step.

Solution

See Project Thesis

Notebooks

Week 1 - EDA , Feature Selection

Week 2 - Dimensionality Reduction

Week 3 - Clustering Genes , Clustering Samples

Week 4 - Model Building Part 1, Model Building Part 2 and Analysis

View the entire project on Github