DistilBERT
GitHub Repository

Project Overview 🧾

This project explores the use of supervised machine learning methods—specifically k-Nearest Neighbors (kNN) and Support Vector Machines (SVM)—to predict the risk of death in ICU patients. The work showcases a complete pipeline, including meticulous preprocessing, targeted feature engineering, hyperparameter optimization, and interpretability of results. The main objective is to model clinical risk through data-driven methods, transforming raw hospital records into actionable insights.

Data Preparation and Feature Engineering 🧪🧹

The preprocessing phase consisted of thoughtful and strategic preparation of the dataset. Outliers, missing values, and the high cardinality of categorical variables such as ethnicity were addressed.

Time-sensitive information was handled appropriately, with the number of previous ICU visits extracted and normalized. Additionally, the use of custom encoders, such as the MultiColumnTargetEncoder, allowed for the grouping of multiple diagnoses into meaningful signal-rich variables. Specifically, the custom encoder was created to be used within the sklearn pipeline framework. As a result, it was possible to perform the target encoding at the same time as the cross-validation split, which prevented data leakage.

The pipeline also took advantage of the ColumnTransformer functionality for a clean and modular structure that makes the approach replicable and extensible:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# [...]

column_transformer = ColumnTransformer([
    ("num", numeric_pipeline, numeric_features),
    ("cat", categorical_pipeline, categorical_features),
])

final_pipeline = Pipeline([
    ("preprocessing", column_transformer),
    ("classifier", SVC(probability=True))
    ])

This strategic design laid a solid foundation for downstream modeling.

Modeling Strategy and Implementation ⚙️📈

Two classifiers were employed for this task: kNN and SVM. Each was thoroughly optimized via grid search to find the best-performing hyperparameters for each setup. The arguments for the number of neighbors (k), weighting strategy, and distance metric in kNN were carefully chosen, and similarly, kernel selection, regularization parameters, and margins were iteratively explored for the SVM model.

The modeling objective was to predict the risk of death for ICU patients based on their clinical and demographic features. A careful computation of individual death probabilities was included:

# Predicting probability of death
probas = final_pipeline.predict_proba(X_test)[:, 1]

ROC-AUC values above 0.9 demonstrated the model’s strong predictive capacity, indicating its robustness even under strict evaluation.

Interpretability and Robustness 🔍🧠

Hyperparameter tuning was executed via detailed grid searches that were iteratively refined. Importantly, the pipeline included methods to detect and manage class imbalance, although future refinements could enhance this further (see below). Moreover, the code was annotated with clear, insightful comments, revealing a deep understanding of each transformation and modeling step.

The final results showed a remarkable alignment between theoretical expectations and empirical findings. Both models performed well, with SVM slightly outperforming kNN in overall accuracy and probability estimation.

Areas for Improvement and Opportunities ✨🧭

While the project already demonstrates a high level of technical execution and conceptual depth, there are certain areas that present opportunities for refinement—each pointing toward potential enhancements rather than shortcomings:

  • Advanced Feature Engineering: Religion could have been encoded in a more nuanced fashion, as was done with ethnicity. Nonetheless, its binary simplification helped keep the dimensionality low, which was a valued modeling choice given the complexity of the dataset.
  • Model Generalization: Including more discussion or testing across different patient subgroups (e.g., by age, sex, or ICU visit count) could enrich the interpretability of model outcomes and help in real-world application.

Final Reflection and Outlook 🌟🩺📉

This project is a strong example of applied machine learning in healthcare. From preprocessing to evaluation, each step was thoughtfully executed, grounded in good data practices and robust modeling techniques. The use of both SVM and kNN enabled a multi-perspective view of the problem space, with valuable insights drawn from performance comparisons.

Future iterations might explore ensemble methods, cost-sensitive learning, or temporal modeling of patient trajectories.

In sum, this project is a compelling demonstration of machine learning’s potential to contribute to critical healthcare decisions. 💡📊🚑