![](https://crypto4nerd.com/wp-content/uploads/2023/02/1pH_XiRvuQ9IzZyLKQO1OTA.png)
Ola, I have a friend who always says hello in a different language when I call her. Back to the main topic, I have been exploring machine learning since last year, and my! it’s been a ride. After this, I’ve come to a realization that nothing is straightforward. Before I start an ml project, I open about 10 tabs because you need to do a lot of reading and gather information before you’re done ( maybe that’s just me hahaha)
In this post, the aim is to predict stroke based on features in the dataset
Data
The data used is from Kaggle Stroke Predictions Dataset. The full notebook can be found here as well
Steps summary
Import libraries
Import data
Clean data
EDA
Correlation Analysis
Label encoding
Split Data
Testing and making Predictions
EDA for Stroke Predictions
Import libraries needed for prediction and read files to understand the dataset
Our objective for the EDA is to extract insights from the dataset and clean the data
Libarires
Python has some great libraries for performing predictions ; sickit learn and naive bayes are the libraries used for this prediction
Other model classifiers such as KNN forest classifier among other classification libraries
Sample of the data is shown below using pandasto load dataset
Check the dataset for duplicates and null vales using df.isnull( ) and df.duplicates( ) to clean dataset . The ID column was dropped as it didn’t have any relevance in the dataset
Pandas is used to find information on the dataset and the data types of the respective columns. The age column was a float and had to be converted to an integer as this would be part of the features for the model
Applying the value counts function to perform a distinct count on the following columns — work type , ever married and stroke column
Visualize data
Seaborn and plotly is used for visualizing users are married and age in relation to stroke11
Feature Engineering
Correlation matrix to find which features / columns are suitable for the model. Turns out BMI is not correlated to stroke as compared to the otehr features
After i did more analysis to find out how type of work is related to likelihood of having stroke using the seaborn library .
The catplot indicates that people who work private jobs have the highest probablity of not getting stroke as compared to people working government jobs having little and chidlred and people who have never worked have no probability of getting stroked
Label encoding
Label encoding can be done to prepare data for training , this converts your numerical and non numerical data to numerical values to enable model make accurate predictions
Defining features for training
The target variable is stroke and all other features will be used for prediction( age , work type , gender etc )
As a result of label encoding every column is converted to numerical values
Split Data
Data is split using the sk model selection library. Setting the test size to 0.6 and random state to 1
Testing and making Predictions
After my data is split i run the naive bayes model against the data , using the fit( ) function i train the data and after make some predictions using the test data
The model scored 88% which is quite accurate but model tuning can be done to improve the model but im yet to explore that 😁
I save the model in an excel file for further analysis. In real life model results can be saved in your database as well
Thanks for reading guys 😁 see you some other time with more pieces