Introduction
Sleep plays a vital role in maintaining our overall health and well-being. Understanding the factors that influence sleep health and lifestyle can provide valuable insights into optimizing our sleep patterns. In this article, we delve into a comprehensive analysis of a Sleep Health and Lifestyle dataset, uncovering intriguing relationships between sleep duration, physical activity, stress levels, and sleep quality. This step-by-step guide will shed light on the findings and their implications for improving sleep habits.
Table of Contents
- Introduction
- Methodology
- Exploratory Data Analysis
- Gender Differences in Sleep Duration
- Occupation and Sleep Duration
- Occupation and Sleep Quality
- Machine Learning Model for Sleep Disorder Prediction
- Conclusion
Methodology
To perform this analysis, we used the Sleep Health and Lifestyle dataset, which contains detailed information on sleep habits, demographics, occupation, physical activity, stress levels, and symptoms of sleep disturbances. We obtained the data set on the kaggle data set provider website (link:’https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset’). The data set is pre-processed to handle missing values and ensure integrity data. We apply various statistical techniques and machine learning algorithms to gain meaningful insights and build predictive models for the classification of sleep disorders.
#download data set from kaggle!pip install kaggle
!kaggle datasets download -d uom190346a/sleep-health-and-lifestyle-dataset
# ekstact data from zip
import zipfile
zip_path = 'sleep-health-and-lifestyle-dataset.zip'
output_path = 'sleep/'
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
zip_ref.extractall(output_path)
# import lybrary
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
#read data set
df=pd.read_csv('sleep/Sleep_health_and_lifestyle_dataset.csv')
df.head()
#detect missing valuesdf.isna().sum()
can be seen, here the “Sleep Soldered” column has a missing value of 219, after I checked this it turned out to be not a missing value, this is a data value with people who don’t have sleep problems
so here I changed Nan to “‘No Sleep Disorder’”
df['Sleep Disorder'] = df['Sleep Disorder'].fillna('No Sleep Disorder')df.isna().sum()
Exploratory Data Analysis
In this section, iam explore the relationships between sleep duration, physical activity, stress levels, and sleep quality. We present visually compelling charts and graphs to highlight the key findings.
- Relationship between sleep duration an sleep quality
korelasi = df['Sleep Duration'].corr(df['Quality of Sleep'])
print('relationship between sleep duration and sleep quality: ', korelasi)plt.scatter(df['Sleep Duration'], df['Quality of Sleep'])
plt.xlabel('Sleep duration')
plt.ylabel('Quality of sleep')
plt.title('relationship between sleep duration and sleep quality')
plt.show()
relationship between sleep duration and sleep quality: 0.8832130004106177
first I saw the correlation between sleep duration and sleep quality, it turns out that the correlation shows the number 0.88, this is a fairly high correlation number. while on the graph it can be read that people who have long sleep duration tend to have good sleep quality
2. The Relationship between Physical Activity and Sleep Quality
sns.boxplot(x='Physical Activity Level', y='Quality of Sleep', data=df)
plt.xlabel('Physical activity level')
plt.ylabel('Quality of sleep')
plt.title('The Relationship between Physical Activity and Sleep Quality')
plt.show()
we can see, people with physical activity 30 have good sleep quality, but people who have physical activity 75 also have good sleep quality
3. The relationship between stress levels and sleep quality
stres = df['Stress Level']
kualitas_tdr = df['Quality of Sleep']kor = stres.corr(kualitas_tdr)
print(kor)
plt.scatter(stres, kualitas_tdr)
plt.xlabel('Stress Level')
plt.ylabel('Quality of Sleep')
plt.title('The relationship between stress levels and sleep quality')
plt.show()
-0.8987520310040427
It can be seen that people with the highest levels of stress have very low sleep quality
Gender Differences in Sleep Duration
sleep_duration=df['Sleep Duration'].mean()
age = df['Age'].mean()
print('average sleep duration')
print(sleep_duration)
print('age average')
print(age)
average sleep duration
7.132085561497325
age average
42.18449197860963
interesting, here I found that the average sleep in this dataset has a value of 7.13, while the average age in this dataset has a value of 42.18
from pandas._libs.tslibs.period import DIFFERENT_FREQ
df_male = df[df['Gender'] == 'Male']
df_female = df[df['Gender'] == 'Female']mean_dur_male = df_male['Sleep Duration'].mean()
mean_dur_female = df_female['Sleep Duration'].mean()
t_statistic, p_value = stats.ttest_ind(df_male['Sleep Duration'], df_female['Sleep Duration'])
print("men's average sleep duration: ",mean_dur_male)
print("women's average sleep duration: ", mean_dur_female)
print("statistic: ", t_statistic)
print("p_value: ", p_value)
Analyzing sleep duration across genders revealed intriguing patterns. On average, males had a sleep duration of 7.04 hours, while females had a slightly longer sleep duration of 7.23 hours. We delve into the potential factors contributing to these differences.
Occupation and Sleep Duration
I examined the relationship between occupation and sleep duration. Surprisingly, professionals such as software engineers and doctors exhibited shorter sleep durations compared to other occupations. We discuss the possible implications and factors influencing sleep duration in different professions.
plt.figure(figsize=(10, 5))
sns.boxplot(x='Occupation', y='Sleep Duration', data=df)
plt.title('Comparison of sleep duration with work')
plt.xlabel('job')
plt.ylabel('duration')
plt.xticks(rotation=45)
plt.show()
Occupation and Sleep Quality
plt.figure(figsize=(10, 5))
sns.boxplot(x='Occupation', y='Quality of Sleep', data=df)
plt.title('comparison of sleep quality with work')
plt.xlabel('job')
plt.ylabel('quality')
plt.xticks(rotation=45)
plt.show()
Exploring the impact of occupation on sleep quality, we found that nurses and engineers reported better sleep quality compared to other occupations. We delve into the potential reasons behind these differences and their implications for occupational health.
Machine Learning Model for Sleep Disorder Prediction
To predict sleep disorder symptoms, we employed machine learning algorithms, including Logistic Regression, Decision Tree, Random Forest, and SVM. We discuss the model selection process, hyperparameter tuning, and evaluation using performance metrics such as accuracy, precision, recall, and F1-score.
first I will choose the best machine algorithm for this prediction
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.feature_selection import SelectKBest, f_classif, RFE# Memisahkan fitur dan target
X = df.drop('Sleep Disorder', axis=1)
y = df['Sleep Disorder']
# Pembagian dataset menjadi data pelatihan dan data pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fitur seleksi menggunakan SelectKBest
selector = SelectKBest(score_func=f_classif, k=5)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
# Model Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train_selected, y_train)
lr_accuracy = lr_model.score(X_test_selected, y_test)
# Fitur seleksi menggunakan Recursive Feature Elimination (RFE)
estimator = DecisionTreeClassifier()
rfe_selector = RFE(estimator, n_features_to_select=5)
X_train_selected_rfe = rfe_selector.fit_transform(X_train, y_train)
X_test_selected_rfe = rfe_selector.transform(X_test)
# Model Decision Tree
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train_selected_rfe, y_train)
dt_accuracy = dt_model.score(X_test_selected_rfe, y_test)
# Model Random Forest
rf_model = RandomForestClassifier()
rf_model.fit(X_train_selected_rfe, y_train)
rf_accuracy = rf_model.score(X_test_selected_rfe, y_test)
# Model Support Vector Machines (SVM)
svm_model = SVC()
svm_model.fit(X_train_selected_rfe, y_train)
svm_accuracy = svm_model.score(X_test_selected_rfe, y_test)
# Menampilkan akurasi model
print("Logistic Regression Accuracy:", lr_accuracy)
print("Decision Tree Accuracy:", dt_accuracy)
print("Random Forest Accuracy:", rf_accuracy)
print("SVM Accuracy:", svm_accuracy)
Logistic Regression Accuracy: 0.88
Decision Tree Accuracy: 0.9066666666666666
Random Forest Accuracy: 0.88
SVM Accuracy: 0.88
Based on the accuracy results, the Decision Tree has the highest accuracy among the models that have been tested with an accuracy value of 0.90666666666666666. However, it is important to note that accuracy alone may not be sufficient to measure overall model quality. You may also consider other evaluation metrics such as precision, recall, or F1-score depending on your needs and goals
from sklearn.metrics import precision_score, recall_score, f1_score# Prediksi menggunakan model Logistic Regression
lr_predictions = lr_model.predict(X_test_selected)
lr_precision = precision_score(y_test, lr_predictions, average='weighted')
lr_recall = recall_score(y_test, lr_predictions, average='weighted')
lr_f1 = f1_score(y_test, lr_predictions, average='weighted')
# Prediksi menggunakan model Decision Tree
dt_predictions = dt_model.predict(X_test_selected_rfe)
dt_precision = precision_score(y_test, dt_predictions, average='weighted')
dt_recall = recall_score(y_test, dt_predictions, average='weighted')
dt_f1 = f1_score(y_test, dt_predictions, average='weighted')
# Prediksi menggunakan model Random Forest
rf_predictions = rf_model.predict(X_test_selected_rfe)
rf_precision = precision_score(y_test, rf_predictions, average='weighted')
rf_recall = recall_score(y_test, rf_predictions, average='weighted')
rf_f1 = f1_score(y_test, rf_predictions, average='weighted')
# Prediksi menggunakan model SVM
svm_predictions = svm_model.predict(X_test_selected_rfe)
svm_precision = precision_score(y_test, svm_predictions, average='weighted')
svm_recall = recall_score(y_test, svm_predictions, average='weighted')
svm_f1 = f1_score(y_test, svm_predictions, average='weighted')
# Menampilkan metrik evaluasi
print("Logistic Regression Precision:", lr_precision)
print("Logistic Regression Recall:", lr_recall)
print("Logistic Regression F1-score:", lr_f1)
print('--------------------------------------------')
print("Decision Tree Precision:", dt_precision)
print("Decision Tree Recall:", dt_recall)
print("Decision Tree F1-score:", dt_f1)
print('--------------------------------------------')
print("Random Forest Precision:", rf_precision)
print("Random Forest Recall:", rf_recall)
print("Random Forest F1-score:", rf_f1)
print('--------------------------------------------')
print("SVM Precision:", svm_precision)
print("SVM Recall:", svm_recall)
print("SVM F1-score:", svm_f1)
Logistic Regression Precision: 0.8860818713450292
Logistic Regression Recall: 0.88
Logistic Regression F1-score: 0.8809411764705881
--------------------------------------------
Decision Tree Precision: 0.9054949494949495
Decision Tree Recall: 0.9066666666666666
Decision Tree F1-score: 0.9058212829069338
--------------------------------------------
Random Forest Precision: 0.8818596218596219
Random Forest Recall: 0.88
Random Forest F1-score: 0.8785395537525356
--------------------------------------------
SVM Precision: 0.8905454545454546
SVM Recall: 0.88
SVM F1-score: 0.8775138356747552
Based on the evaluation metrics, we can see the model’s performance based on precision, recall, and F1-score:
- Logistic Regression has a precision of 0.8860818713450292, a recall of 0.88, and an F1-score of 0.8809411764705881.
- The Decision Tree has a precision of 0.9054949494949495, a recall of 0.9066666666666666, and an F1-score of 0.9058212829069338.
- Random Forest has a precision of 0.8818596218596219, a recall of 0.88, and an F1-score of 0.8785395537525356.
- SVM has a precision of 0.8905454545454546, recall of 0.88, and F1-score of 0.8775138356747552.
Based on this evaluation metric, the Decision Tree model has slightly better performance with higher precision, recall, and F1-score compared to the other models.
by using decision tree i tried optimization method like hyperparameter tuning
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifiermodel = DecisionTreeClassifier()
parameters = {'max_depth': [5, 10, 15], 'min_samples_split': [2, 5, 10]}
grid_search = GridSearchCV(model, parameters, cv=5)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
ensemble_model = RandomForestClassifier(n_estimators=100)
ensemble_model.fit(X_train, y_train)
cross_val_scores = cross_val_score(model, X_train, y_train, cv=5)
print("Cross-Validation Scores:", cross_val_scores)
Cross-Validation Scores: [0.78333333 0.85 0.93333333 0.93333333 0.89830508]
This accuracy score shows the performance of the Decision Tree model on each subset that is divided into cross-validation. By looking at this accuracy score, can evaluate the stability and generalizability of the model
Conclusion
In this comprehensive analysis of sleep health and lifestyle, we uncovered intriguing relationships between sleep duration, physical activity, stress levels, and sleep quality. The findings shed light on the importance of maintaining adequate sleep duration, managing stress levels, and engaging in regular physical activity for optimal sleep quality. Furthermore, the machine learning model developed for sleep disorder prediction showcased promising accuracy and performance metrics, paving the way for potential applications in sleep disorder screening.
By understanding the interplay between various factors influencing sleep health, individuals can make informed decisions to improve their sleep habits and overall well-being.
to see the full code you can visit the following link [‘https://github.com/BerilCa/Machine_Learning/tree/main/Sleep%20Health%20and%20Lifestyle%20Dataset’]