This project aimed to harness machine learning to classify host organisms by gene expression, leveraging diverse models to pinpoint critical genetic markers. Working as part of a team, I led the technical setup and analysis workflow, utilizing advanced bioinformatics methods and statistical tools to determine which genes significantly impact phenotype prediction.
Project Highlights
Objective: To develop a predictive model for distinguishing phenotypes in specific host organisms using gene expression data.
Technologies Used: Python, R
Machine Learning Libraries: Scikit-learn for implementing Logistic Regression, Linear Discriminant Analysis (LDA), Support Vector Machines (SVM), and Quadratic Discriminant Analysis (QDA).
Data Manipulation: Pandas and NumPy Data Visualization: Matplotlib and Seaborn to create performance heatmaps, ROC curves, and confusion matrices, helping analyze model accuracy and error distribution.
Key Objectives and Findings
Algorithm Selection and Performance:
Each algorithm was selected for its suitability in analyzing the dataset’s unique characteristics, such as logistic regression for straightforward interpretability, LDA and QDA for handling linear and nonlinear boundaries, and SVM for high-dimensional data.
Among the algorithms, SVM emerged as the best performer, achieving 100% accuracy with a 0% error rate, effectively distinguishing between host types and identifying four key genes (O05496, Q9DCM0, Q27960, and C0H419) with significant predictive power.
Phenotypes and Gene Significance:
Key genes like O05496 (important for carbohydrate metabolism) and Q9DCM0 (involved in energy management and detoxification) played crucial roles in phenotype prediction.
Different algorithms highlighted distinct genes, each potentially contributing to the host's biological functions, indicating a complex, multi-gene interaction in phenotype determination.
Algorithmic Insights:
Logistic Regression: Provided an interpretable model with a 78% accuracy rate, identifying two genes with significant coefficients.
LDA: Achieved a 77.78% accuracy rate, marking genes with high coefficients as strong predictors.
SVM with Recursive Feature Elimination: Excelled with a perfect classification rate, underscoring the algorithm’s robustness in high-dimensional data.
PLS and QDA: While not the top performers, they offered insights into dimensionality reduction and biological relevance, further validating the identified gene markers.
Outcomes:
This project showcased the power of machine learning in bioinformatics, highlighting my skills in model selection, feature engineering, and data visualization. The experience taught me how to optimize predictive accuracy while maintaining interpretability, especially valuable for high-dimensional genetic data. This application of SVM, combined with a comparative analysis of algorithms, revealed unique insights into gene expression patterns, contributing both to my technical expertise and to advancements in phenotype classification.



