Projects

Interpretable Deep-Learning and Ensemble Models for Predicting Multidrug Resistance in Klebsiella pneumoniae

Repository: github.com/NasirNesirli/kleb-amr-project

Summary

A comprehensive, reproducible Snakemake workflow for genomic prediction of antimicrobial resistance (AMR) in Klebsiella pneumoniae using tree-based ensemble methods and deep learning architectures with temporal validation and interpretability analysis.

Key Features

20-Stage Pipeline: Complete workflow from data acquisition to ensemble evaluation
Multi-Model Comparison: XGBoost, LightGBM, 1D CNN, Sequence CNN, and DNABERT-2
Temporal Validation: Rigorous pre-2023 training → 2023-2024 testing to ensure real-world generalization
Interpretability: SHAP-based feature importance analysis with biological validation
Four Antibiotic Classes: Meropenem, ceftazidime, ciprofloxacin, and amikacin
Fully Reproducible: Conda environments and scalable infrastructure (32 vCPU, 128GB RAM)

Technical Stack

Workflow: Snakemake 7.32+
Bioinformatics: FastQC, fastp, SPAdes, AMRFinderPlus, Snippy, Kraken2, QUAST
Machine Learning: XGBoost, LightGBM, scikit-learn
Deep Learning: PyTorch, Transformers, DNABERT-2
Interpretability: SHAP
Languages: Python, Bash

Key Results

Tree-based models (XGBoost, LightGBM) consistently outperformed deep learning approaches with:

XGBoost: 0.824 ROC-AUC, 0.787 sensitivity, 0.800 specificity
LightGBM: F1-score 0.857 for cephalosporin prediction (exceeding clinical threshold)
SHAP analysis identified biologically meaningful resistance determinants including KPC carbapenemase and RND efflux pump components