Projects
Interpretable Deep-Learning and Ensemble Models for Predicting Multidrug Resistance in Klebsiella pneumoniae
Repository: github.com/NasirNesirli/kleb-amr-project
Summary
A comprehensive, reproducible Snakemake workflow for genomic prediction of antimicrobial resistance (AMR) in Klebsiella pneumoniae using tree-based ensemble methods and deep learning architectures with temporal validation and interpretability analysis.
Key Features
- 20-Stage Pipeline: Complete workflow from data acquisition to ensemble evaluation
- Multi-Model Comparison: XGBoost, LightGBM, 1D CNN, Sequence CNN, and DNABERT-2
- Temporal Validation: Rigorous pre-2023 training → 2023-2024 testing to ensure real-world generalization
- Interpretability: SHAP-based feature importance analysis with biological validation
- Four Antibiotic Classes: Meropenem, ceftazidime, ciprofloxacin, and amikacin
- Fully Reproducible: Conda environments and scalable infrastructure (32 vCPU, 128GB RAM)
Technical Stack
- Workflow: Snakemake 7.32+
- Bioinformatics: FastQC, fastp, SPAdes, AMRFinderPlus, Snippy, Kraken2, QUAST
- Machine Learning: XGBoost, LightGBM, scikit-learn
- Deep Learning: PyTorch, Transformers, DNABERT-2
- Interpretability: SHAP
- Languages: Python, Bash
Key Results
Tree-based models (XGBoost, LightGBM) consistently outperformed deep learning approaches with:
- XGBoost: 0.824 ROC-AUC, 0.787 sensitivity, 0.800 specificity
- LightGBM: F1-score 0.857 for cephalosporin prediction (exceeding clinical threshold)
- SHAP analysis identified biologically meaningful resistance determinants including KPC carbapenemase and RND efflux pump components