Stroke Predictions using Machine Learning | by Pierre BA

Ola, I have a friend who always says hello in a different language when I call her. Back to the main topic, I have been exploring machine learning since last year, and my! it’s been a ride. After this, I’ve come to a realization that nothing is straightforward. Before I start an ml project, I open about 10 tabs because you need to do a lot of reading and gather information before you’re done ( maybe that’s just me hahaha)

In this post, the aim is to predict stroke based on features in the dataset

Data

The data used is from Kaggle Stroke Predictions Dataset. The full notebook can be found here as well

Steps summary

Import libraries

Import data

Clean data

EDA

Correlation Analysis

Label encoding

Split Data

Testing and making Predictions

EDA for Stroke Predictions

Import libraries needed for prediction and read files to understand the dataset

Our objective for the EDA is to extract insights from the dataset and clean the data

Libarires

Python has some great libraries for performing predictions ; sickit learn and naive bayes are the libraries used for this prediction

Other model classifiers such as KNN forest classifier among other classification libraries

Sample of the data is shown below using pandasto load dataset

Check the dataset for duplicates and null vales using df.isnull( ) and df.duplicates( ) to clean dataset . The ID column was dropped as it didn’t have any relevance in the dataset

Pandas is used to find information on the dataset and the data types of the respective columns. The age column was a float and had to be converted to an integer as this would be part of the features for the model

Applying the value counts function to perform a distinct count on the following columns — work type , ever married and stroke column

Visualize data

Seaborn and plotly is used for visualizing users are married and age in relation to stroke11

Feature Engineering

Correlation matrix to find which features / columns are suitable for the model. Turns out BMI is not correlated to stroke as compared to the otehr features

After i did more analysis to find out how type of work is related to likelihood of having stroke using the seaborn library .

The catplot indicates that people who work private jobs have the highest probablity of not getting stroke as compared to people working government jobs having little and chidlred and people who have never worked have no probability of getting stroked

Label encoding

Label encoding can be done to prepare data for training , this converts your numerical and non numerical data to numerical values to enable model make accurate predictions

Defining features for training

The target variable is stroke and all other features will be used for prediction( age , work type , gender etc )

As a result of label encoding every column is converted to numerical values

Split Data

Data is split using the sk model selection library. Setting the test size to 0.6 and random state to 1

Testing and making Predictions

After my data is split i run the naive bayes model against the data , using the fit( ) function i train the data and after make some predictions using the test data

The model scored 88% which is quite accurate but model tuning can be done to improve the model but im yet to explore that 😁

I save the model in an excel file for further analysis. In real life model results can be saved in your database as well