Projects done under Purdue-Simplilearn PGP AI & ML
View the Project on GitHub lookupinthesky/Purdue-Simplilearn-AI-ML
ICMR wants to analyze different types of cancers, such as breast cancer, renal cancer, colon cancer, lung cancer, and prostate cancer becoming a cause of worry in recent years. They would like to identify the probable cause of these cancers in terms of genes responsible for each cancer type. This would lead us to early identification of each type of cancer reducing the fatality rate.
The input dataset contains 802 samples for the corresponding 802 people who have been detected with different types of cancer. Each sample contains expression values of more than 20K genes. Samples have one of the types of tumors: BRCA, KIRC, COAD, LUAD, and PRAD.
Plot the merged dataset as a hierarchically-clustered heatmap to see if the five classes of genes are shown distinctively to see their existence
Apply the feature selection algorithms, filter the actual dataset for selected columns and save it as a new DataFrame to represent the feature selection data
Each sample has expression values for around 20K genes. However, it may not be necessary to include all 20K genes’ expression values to analyze each cancer type. Therefore, identify a smaller set of attributes which will then be used to fit multiclass classification models using dimensionality reduction techniques.
Filter the actual dataset for the columns (or genes) suggested by this approach and save it as a new DataFrame to represent the dimensionality reduction data.
Identify groups of genes that behave similarly across samples and identify the distribution of samples corresponding to each cancer type.
Build a robust classification model for identifying each type of cancer. Try variants of SVM, Random Forest and Neural Networks on original, selected and extracted features and evaluated on AUC score.
Write an observation based on your analysis of the best models considered in the previous step.
Week 1 - EDA , Feature Selection
Week 2 - Dimensionality Reduction
Week 3 - Clustering Genes , Clustering Samples
Week 4 - Model Building Part 1, Model Building Part 2 and Analysis