![](https://crypto4nerd.com/wp-content/uploads/2023/08/1XYh1T23O8yCCxXk9qflbVw.png)
In healthcare, timely and precise diagnosis is crucial for effective patient treatment. Sepsis is a serious condition that can result from the body’s response to an infection, and early detection is crucial to prevent its progression. To help medical professionals with this task, an online application has been developed using FastAPI. This application employs machine learning to predict the likelihood of sepsis based on input features. By analyzing relevant patient data with a trained model, healthcare providers can identify and manage sepsis cases quickly and efficiently. This program is an invaluable tool for healthcare professionals seeking to anticipate and respond to sepsis cases promptly.
Healthcare professionals are always seeking new ways to improve patient outcomes, and predictive tools can be immensely helpful. Traditional diagnostic methods can be time-consuming and prone to human error. The integration of Machine Learning into medical practice has opened up new possibilities for more efficient and precise diagnosis. This paper discusses a machine learning-powered application that predicts sepsis in patients. It is notable for its use of FastAPI for API development, Docker for deployment, and integration with the Hugging Face platform.
Before delving into the application details, it is crucial to establish a suitable environment. The first step is installing important libraries like scikit-learn and imbalanced-learn, which are necessary for creating and implementing machine learning models. Additionally, the article highlights the significance of using joblib/pickle for object storing to maintain model consistency and reusability.
Data is essential for any machine learning project. This article explains the methods used to load and explore a dataset. It covers a detailed analysis of the dataset using frameworks like pandas, numpy, seaborn, and matplotlib. The research involves using visualization and statistics to understand the structure, distribution, and properties of the data. The article also highlights the significance of handling missing values and duplicate entries to ensure data quality and reliability..
To begin, the article demonstrates how to load the dataset onto Google Colab through the drive mount mechanism. It provides a glimpse of the dataset by displaying the initial rows to familiarize you with its organization and information. The dataset encompasses various factors like plasma glucose, blood pressure, body mass index, and more, which can predict the likelihood of sepsis. Additionally, the categorical variable “Sepsis” denotes if a patient has contracted sepsis, as shown in the following table.
ID PRG PL PR SK TS M11 BD2 Age Insurance Sepssis
0 ICU200010 6 148 72 35 0 33.6 0.627 50 0 Positive
1 ICU200011 1 85 66 29 0 26.6 0.351 31 0 Negative
2 ICU200012 8 183 64 0 0 23.3 0.672 32 1 Positive
3 ICU200013 1 89 66 23 94 28.1 0.167 21 1 Negative
4 ICU200014 0 137 40 35 168 43.1 2.288 33 1 Positive
… … … … … … … … … … … …
594 ICU200604 6 123 72 45 230 33.6 0.733 34 0 Negative
595 ICU200605 0 188 82 14 185 32.0 0.682 22 1 Positive
596 ICU200606 0 67 76 0 0 45.3 0.194 46 1 Negative
597 ICU200607 1 89 24 19 25 27.8 0.559 21 0 Negative
598 ICU200608 1 173 74 0 0 36.8 0.088 38 1 Positive
599 rows × 11 columns
The dataset’s metadata has been thoroughly investigated, with a detailed explanation of each column’s meaning and significance provided. The article also confirms that all columns contain complete data, ensuring a clean dataset for training machine learning models effectively.
This article explores Exploratory Data Analysis (EDA), which is an essential phase for comprehending the characteristics and relationships of a dataset. By utilizing histograms, summary statistics, and visualizations, the article offers valuable insights into the distribution and behavior of numerical variables. The article highlights some key observations, such as the right-skewed distribution of Plasma glucose, the relatively normal distribution of Blood Work Result-1, and the highly right-skewed Blood Work Result-3 distribution due to extreme values. These observations are illustrated in Figure 1.
Figure 1
To understand the relationships and patterns in a dataset, it’s necessary to use univariate and bivariate analytics. This article explains these principles and demonstrates their significance using a dataset. Box plots are utilized to show the distribution of numerical variables in the “Sepssis” category. These charts illustrate how the numerical characteristics differ between patients with and without sepsis. Additionally, the report includes a categorical analysis that displays how the occurrence of sepsis varies with various types of insurance. These findings are presented in Figures 2, 3, 4, and 5.
9.0 Correlation Heatmap
It is crucial to comprehend the connections between numerical features when selecting features and creating models. An article showcases a correlation heatmap that displays the magnitude and direction of correlations between numerical variables. This aids in identifying possible multicollinearity and selecting features for machine learning models. The heatmap is visible in Figure 6.
10. Model Selection and Evaluation
Once the data has been processed and relevant features have been engineered, the next step is to choose and assess the machine learning models. In this section, we will train and evaluate nine classification models to determine which ones are the best performers for our task. The models we will be examining are:
1. Logistic Regression
2. K-Nearest Neighbors
3. Decision Tree
4. Support Vector Machine (Linear Kernel)
5. Support Vector Machine (RBF Kernel)
6. Neural Network
7. Random Forest
8. Gradient Boosting
9. XGBoost
To ensure a fair comparison, each model will be trained using the same pipeline, consisting of a standard scaler for feature scaling and the specific classification algorithm. We will evaluate the models based on precision, recall, F1-score, and accuracy to understand their performance across different evaluation metrics.
Once you have identified promising models, the next step is to fine-tune their hyperparameters to achieve optimal performance. In this section, we will concentrate on the models that have demonstrated the highest performance:
1. Gradient Boosting
2. K-Nearest Neighbors
3. Support Vector Machine (RBF Kernel)
4. Logistic Regression
To enhance our models’ performance on validation data, we will employ GridSearchCV. This technique exhaustively searches through a predetermined hyperparameter grid to identify the optimal combination of hyperparameters. The horizontal bar graphs in Figure illustrate the F1-score performance of the four models.7.
In this section, we’ll assess the performance of hyperparameter-tuned models and choose the best one for deployment. We’ll evaluate metrics like F1-score, accuracy, and other relevant factors to make an informed decision. The final model will be the one that shows the highest overall performance on the validation data.
It’s important to know which features are most significant in a predictive model to gain insights into what drives predictions. In this section, we’ll examine the feature importance of the final model (Gradient Boosting). By interpreting and visualizing feature importance, we can pinpoint the most influential features that contribute to the model’s predictions, shown in Figure 8.
Evaluating binary classification model performance requires the use of the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) score. This section will include a visualization of the ROC curve for the chosen model and an AUC score calculation. The ROC curve helps determine the balance between true positive and false positive rates at different classification thresholds, while the AUC score measures the model’s ability to differentiate between positive and negative classes.
· The AUC (Area Under the Curve) score of 0.785 reflects how well the Gradient Boosting model distinguishes between positive and negative classes. The score ranges from 0 to 1, with 0.5 indicating a random classifier and 1.0 indicating a perfect classifier. Refer to Figure 9.
The confusion matrix is a useful tool for evaluating the performance of a model, especially with regards to classification errors. In this section, we will analyze the confusion matrix of the chosen model to gain insight into its true positives, true negatives, false positives, and false negatives. By studying these numbers, we can comprehend the errors made by the model and make informed decisions about its performance. Figure 10 displays the confusion matrix, which can be summarized as follows:
The confusion matrix for the Gradient Boosting model indicates the following:
True Positive (TP): There are 26 instances that are correctly predicted as positive (actual positive and predicted positive).
True Negative (TN): There are 62 instances that are correctly predicted as negative (actual negative and predicted negative).
False Positive (FP): There are 16 instances that are incorrectly predicted as positive (actual negative but predicted positive).
False Negative (FN): There are 16 instances that are incorrectly predicted as negative (actual positive but predicted negative).
These results show that the Neural Network model correctly predicted 26 positive cases and 62 negative cases. However, it misclassified 16 instances as false positives and 16 instances as false negatives. The model’s performance is not perfect, but it is capturing both positive and negative cases to some extent.
After constructing and fine-tuning our model, we will apply it to new, unseen data to generate predictions. First, we will load the test dataset and preprocess it similarly to the training and validation data. Then, we will apply the chosen model to estimate the likelihood of sepsis for each patient. In our article, we will describe how the model performed on this previously unseen data and the potential consequences for clinical decision-making. This thorough process of developing and testing a machine-learning model for sepsis prediction will be useful for both data scientists and healthcare professionals interested in using machine learning for medical diagnosis.
After training and evaluating the machine learning model, the following step is to develop an API that permits us to make predictions utilizing the model. To accomplish this, we will use the FastAPI framework, which is a modern and effective web framework designed for creating APIs in Python. Using the FastAPI app, we can input patient data and obtain predictions for the likelihood of sepsis based on the trained model..
17.1 Setting Up the FastAPI App
In this section, we will guide you through the process of setting up the FastAPI application. We will define a Pydantic model that represents the input data for prediction, load the trained machine-learning components, and create API routes to handle prediction requests. You can find the FastAPI code in my repository link provided at the end of this article.
17.2 Handling Predictions and Providing Results
Our FastAPI application includes an endpoint called “/classify” which takes in patient data as input and generates predictions for sepsis likelihood. The input data is analyzed and our machine learning model is utilized to make the predictions. The results are then returned along with confidence scores for each prediction category. Please see Figures 11 to 15 for a visual representation:
Output of the root endpoint:
Sepsis_Classification_endpoint_Pre_execution:
Sepsis_Classification_endpoint_Post_execution:
17.3 Running the FastAPI App
FastAPI depends on the Uvicorn server to run it apps, the app will be assigned a specified port (for example, 8000) and accessible through HTTP requests. To run the FastAPI app, you can use the following steps:
command:
uvicorn main:app - host 0.0.0.0 - port 8000 - reload
Packaging applications and their dependencies in a consistent manner is made possible by containerization. This ensures that they can run reliably in different environments. Docker, a widely used containerization platform, enables us to manage and create containers for our application. To simplify deployment and ensure smooth functioning in different settings, we will use Docker to containerize the FastAPI app.
18.1 Dockerfile
We define the base image, working directory, dependencies, and application code in the Dockerfile. The “CMD” instruction sets the command to be executed when the container starts, launching the FastAPI app with “uvicorn”. Figure 16 shows the dockerfile for and the container action on hugging face.
FROM python:3.9WORKDIR /app
COPY ./requirements.txt /requirements.txt
RUN pip install --no-cache-dir --upgrade -r /requirements.txt
COPY . .
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "7860"]
The Docker Container in action Huggingface:
Deploying the FastAPI app is made simple through containerization using Docker. With just one command, all components of the application stack, including the FastAPI app, machine learning model, and necessary dependencies, can be easily deployed on a variety of platforms, including local servers, cloud platforms, and Kubernetes clusters..
19.1 Running the Docker Container
To run the Docker container and deploy the FastAPI app, use the following command:
The application will be accessible at http://localhost:8000, and you can make prediction requests using the /classify endpoint.
It is also accessible on huggingface via the link https://junr-syl-api-sepsis-classifier.hf.space/docs#/
20. Take Away
In this article, we have covered the entire process of creating a machine learning model to predict sepsis and deploying it as a FastAPI application using Docker. We have provided a detailed explanation of each stage, including data preprocessing, model selection, building a robust API, and containerizing the application. By employing this technique, we can efficiently construct predictive models and ensure that they are seamlessly deployed for real-world applications. Utilizing tools such as FastAPI and Docker is beneficial in bridging the gap between model creation and practical deployment, contributing to the growth of healthcare and other disciplines as data science and machine learning continue to evolve.
21. Improvements
Although the sepsis prediction model has been developed and put into use effectively, there are still several possibilities for further improvement and exploration. Some potential future directions include improving the model’s comprehensibility, integrating real-time data streaming to enable dynamic predictions, and linking the program with electronic health record systems for smoother integration into clinical workflows. By constantly refining and expanding on our current work using advanced machine learning methods, we can contribute to the ongoing efforts to improve patient outcomes.2.
I am grateful for your time in reading this article. Our process of moving from data pre-processing to model deployment has helped us connect machine learning research with practical implementation, leading to innovative solutions in healthcare and beyond. If you have any suggestions or inputs, please feel free to contact me through the email provided.
I highly recommend Azubi Africa for their comprehensive and effective programs. Read More articles about Azubi Africa here and take a few minutes to visit this link to learn more about Azubi Africa’s life-changing programs