Basic Machine Learning with Python (MySkill Data Science Project) | by Arif Kurniawan

Machine learning is a subfield of artificial intelligence that deals with building algorithms and models that can learn from data and make predictions or decisions based on that learning.

At a high level, the process of machine learning can be broken down into three main steps:

Data Collection and Preparation: The first step is to collect and clean the data that you will use to train your model. This step is crucial as the quality of your model will depend on the quality of your data.
Model Training: Once the data is prepared, the next step is to train a machine learning model using this data. During this step, the model is exposed to the training data, and it learns the relationships between the input features and the target variable.
Model Evaluation and Deployment: After the model is trained, the next step is to evaluate its performance on a separate test dataset. Based on the evaluation results, the model may need to be refined or a different model may need to be selected. Once the final model is selected, it can then be deployed for making predictions in real-world scenarios.

There are many different types of machine learning algorithms, including supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning, to name a few. Each type of algorithm is designed to handle a different type of problem.

This article is a project for a supervised machine learning, using Python. And I also use Google Colab to perform this task.

First, I need to import the required libraries, like this:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,classification_report

Because I work in Google Colab and my dataset for this task is inside my Google Drive, I need to connect the Google Drive to Google Colab.

from google.colab import drive
drive.mount('/content/drive')

After connecting to Google Drive, I need to load the dataset ‘mushrooms.csv’ using pandas and look at the dataset.

df = pd.read_csv('/content/drive/MyDrive/dataset/mushrooms.csv')

I need to check every data in in variable, using code df.info(). With this code, I will check whether I have null values or datatypes of my data.

I also make visualiztion for every variable in my dataset, using combination of for looping, seaborn countplot and matplotlib.pyplot to count all of each unique values in all columns.

for i in df.columns:
sns.countplot(data=df,x=i)
plt.show()

I use this code below if i want to know the amount of unique value in each columns.

df.nunique()

The amount of unique values in each columns

The next step, I will separate independent and dependent/target variable. In this case, the ‘class’ column is target variable, so i will make new variable, ‘y’ as dependent/target variable, and ‘x’ as independent variable. I will code like this:

x=df.drop('class',axis=1)
y=df['class']

And since machine learning can only process numeric data, I need to encode categorical data to numerical data with LabelEncoder(). I will change both ‘x’ and ‘y’ data to numerical data.

Encoder_X = LabelEncoder() 
for col in x.columns:
x[col] = Encoder_X.fit_transform(x[col])
Encoder_y=LabelEncoder()
y = Encoder_y.fit_transform(y)

y data after encoding

Before i implement machine learning model, i need to split both of data into train and test data with ratio of 8:2.

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

After splitting into train and test data, i can start making machine learning model. In this project, i will use Random Forest to predict classification case. I will import RandomForestClassifier and declare it as rf, and I can fit X_train and y_train for modelling.

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(oob_score=True)
rf.fit(X_train, y_train)

After we make the model, we start predict the X_test data first.

y_pred=rf.predict(X_test)

After making the ML model and predict with it, I need to evaluate the result of the model. There are 2 ways to do that, the confusion matrix and the classification report. For the confusion matrix, I will use heatmap to better understand the prediction result.

sns.heatmap(confusion_matrix(y_test, y_pred),annot=True);

The result with for Random Forest according to confusion matrix is that there are no false positive or false negative prediction. The prediction only generate true positive and true negative. It means that the result has achieved perfect accuracy.

Beside confusion matrix, I could evaluate it with classification report. Classification report also used to know the precision, recall and f1-score.

print(classification_report(y_test,y_pred))

The result of this classification report is that the precision for both classes is 1.00, meaning that all positive predictions made by the classifier are accurate. The recall for both classes is 1.00, meaning that the classifier was able to identify all positive instances. The F1-Score for both classes is 1.00, indicating that the classifier has both high precision and high recall.

Overall, the classification report shows that the classifier has an accuracy of 100% (1.00), with perfect precision, recall, and F1-Score for both classes. This is an excellent performance for the classifier and indicates that it is able to accurately predict the class labels for all instances in the test set.

Source link