![](https://crypto4nerd.com/wp-content/uploads/2023/08/0CWVIIJrWxRdY8Pwr-1024x683.jpeg)
Why do we need ROC CURVE?
“THRESHOLD SELECTION” — This is the main reason for which we need to use ROC CURVE.
This is related to the “Classification Term” very much and we can say with “BINARY CLASSIFICATION”
Threshold Selection :
- Ex- We have supposed college data with features of “IQ”, “CGPA” and “Placement” so students get placed or do not have the result in binary “YES” or “NO.
- Now will divide the data into two parts “Training set”, and “Test set”. Train model on training data and test model on test data.
- So in testing data when we get a result from the model it directly won’t give results in “0” & “1” The model gives results in
“Probability”
by providing any “number” which will tell whether the student will be placed or not. - Ex- one student’s result prediction is — 0.45 or 45% and like this for all students we will get probability-based results.
- Now we have to convert this “
Probability result”
into the class level“0”
&“1”
by deciding“THRESHOLD”
- We have to choose the threshold value so that we can divide the result into both class levels and when we decide the threshold manually by intuition it can be right or wrong when we compare the result with the actual one.
- Ex- We decide Threshold = 0.5 and based on this we decide the student probability is more than the threshold will placed and below threshold student wont placed.
BUT ALWAYS THRESHOLD “0.5” WOULD NOT WORK.
- EX — EMAIL CLASSIFICATION (Trained model with a lot of emails for predicting “Spam”, Not Spam” mail identification.
There could be two mistakes can be made by the model –
EMAIL NOT SPAM — PREDICT SPAM
EMAIL SPAM — PREDICT NOT SPAM
“Sometimes both mistakes won’t have similar importance it varies from case to case.
- Suppose I got mail for an interview and my model put this mail in spam which is not actually spam mail so my model made a blunder for me I will rely on the model and I will update the Threshold value to — “0.75” and train the model and it will help to reduce this mistake.
- This is the power of “threshold” so based on the model performance we can increase or decrease the threshold value.
How much value should choose in “threshold” we can decide by “ROC CURVE”
It is like a “Report Card”
for “Binary Classification”
.
In one glance we can understand how our model performs.
True Positive (TP)
: Correctly predicting a label (we predicted “yes”, and it’s“yes”),
True Negative (TN)
: Correctly predicting the other label (we predicted “no”, and it’s “no”),
False Positive (FP)
: Falsely Predicting a label (we predicted “yes”, but it’s “no”),
False Negative (FN)
: Missing and incoming label (we predicted “no”, but it’s “yes”).
True Positive Rate-
TPR = TP / (TP + FN)
It will give an intuition of benefit. How much benefit can get from the system.
- Ex- Creating Netflix Churn rate prediction model to find user patterns.
- “1”
— Leave the platform, “0”
– will not leave the platform
- So suppose we have 100 customers who want to leave Netflix and my model detects 80 only so my “TPR” will be 80%. We always want to maximize “TPR” as it can solve the problem better. When “FALSE NEGATIVE” WILL BE ZERO THEN TRUE POSITIVE WILL BE 100 %.
False Positive Rate-
FPR = FP / (FP + TN)
“TREAT IT AS COST” How much expensive model will be? We create any model for getting a solution and if the model does not perform 100 per accuracy base then it is a cost that needs to suffer. If the customer leaves the platform due to a wrong prediction it will add cost because in the churn rate suppose the model said these people leave the platform and we give some benefits to hold those customers and in actuality, they are never thought to leave the platform so we add cost in holding them which was not needed even.
- Ex- Email Spam — Out of all those not spam email how many email does our model say is spam?
- Ex- Netflix Churn — Out of all those people who are not living the platform model said will leave the platform.
ROC CURVE –
(RECEIVER OPERATOR CHARACTERISTIC)
- (BENEFIT & COST MODEL) On “The axis’s FPR and on “The axis’s TPR. Inside all lines called
“ROC CURVE”
- “Graph always between 0 and 1 as TPR and FPR value can between only 0 and 1.
EXPLANATION:-
- Suppose we have student data with few labels and need to find the placement prediction of the student.
- So we decided to perform the “Logistic Regression” model as we need prediction in binary. Then divide the data into “training” and “testing set” train the model on the training set and check the result on test data to check the accuracy of the model by trying out different thresholds
(0.3,0.5, 0.6,0.8)
so for every threshold value we get a “confusion
matrix”
and with every confusion matrix, we can calculate“TPR & FPR”
. So we will get“TPR, FPR”
values for all thresholds that we will put on the graph and it will create a“CURVE”
which is called“ROC CURVE”
and by watching the ROC curve we can decide which threshold is best to use. And near“1”
whichever threshold point occurs will choose that as the best threshold value.
WHEN WE
DECREASE THE THRESHOLD
VERY LOW (0.1 CLOSE TO 0)
FPR
ANDTPR
WILLINCREASE
ANDMODEL PERFORMANCE
WILL BE VERY BAD,NOW IF WE
INCREASE THE THRESHOLD
AROUND(0.99 CLOSE TO 1)
TPR
andFPR
willDECREASE
or can say point come around 0 on the graph. THIS MEANS WE ARE DETECTING VERY LOW PREDICTION IN THE RIGHT MANNER.
- Ex- in Email spam, we are predicting very few mail as spam if we take a threshold of 99 percent so if any probability value comes above this rate will count as spam.
- In this case, TP will decrease as we did not predict actual spam emails as spam.
- And FN will increase as we did not call actual spam as spam. So eventually TPR WILL INCREASE.
Now if I reduce my threshold from
0.99 to 0.85
means model will predict a little more spam. ThenTrue Positive (TP)
will Increase andFalse Positive
(FP)
decrease. So TPR will Increase.But on this threshold FPR won’t increase on which rate TPR increases or can say FP(False Positive) won’t increase, FP means the mail which is “NOT SPAM” and the model predicts “SPAM”. So the graph will increase more in the “y direction” compared to the “X direction” so It will start taking
“CURVY SHAPE”
import pandas as pddata = pd.read_csv('https://raw.githubusercontent.com/npradaschnor/Pima-Indians-Diabetes-Dataset/master/diabetes.csv')
data.head()
X = data.drop('Outcome', axis=1)
y = data['Outcome']from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_scores = model.predict_proba(X_test)[:,1]y_scores
# For all test data, we created probabilities to check patient diabetes
# As can see, first patient probability of being diabetes is 0.44 and so on.
from sklearn.metrics import roc_curvefpr, tpr, thresholds = roc_curve(y_test, y_scores)
# generated thresholds and fpr, tpr value.
thresholds
# get threshold value between 0 and 1.
import plotly.graph_objects as go
import numpy as np# Generate a trace for ROC curve
trace0 = go.Scatter(
x=fpr,
y=tpr,
mode='lines',
name='ROC curve'
)
# Only label every nth point to avoid cluttering
n = 10
indices = np.arange(len(thresholds)) % n == 0 # Choose indices where index mod n is 0
trace1 = go.Scatter(
x=fpr[indices],
y=tpr[indices],
mode='markers+text',
name='Threshold points',
text=[f"Thr={thr:.2f}" for thr in thresholds[indices]],
textposition='top center'
)
# Diagonal line
trace2 = go.Scatter(
x=[0, 1],
y=[0, 1],
mode='lines',
name='Random (Area = 0.5)',
line=dict(dash='dash')
)
data = [trace0, trace1, trace2]
# Define layout with square aspect ratio
layout = go.Layout(
title='Receiver Operating Characteristic',
xaxis=dict(title='False Positive Rate'),
yaxis=dict(title='True Positive Rate'),
autosize=False,
width=800,
height=800,
showlegend=False
)
# Define figure and add data
fig = go.Figure(data=data, layout=layout)
# Show figure
fig.show()
# Assume that fpr, tpr, thresholds have already been calculated
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
print("Optimal threshold is:", optimal_threshold)# calculating best threshold which is nearest to "1"
- Optimal threshold is: 0.5503810234218872
AUC- ROC
(AREA UNDER THE CURVE)
“By using AUC- ROC we can compare two models to find which one is better classifier.”
The AUC- ROC measures the entire two- dimensional area underneath the entire ROC curve from (0,0) to (1,1). AUC provides an aggregate measure of performance across all possible classification thresholds.
- An
AUC of 1
indicates that the model has perfect discrimination: it correctly classifies all positive and negative instances. - An
AUC of 0.5
suggests the model has no discrimination ability: It is as good as random guessing. - An
AUC of 0
indicates that the model is perfectly wrong: It classifies all positive instances as negative and all negative instances as positive.
In practice, AUC values
usually fall between 0.5 (random) and 1(perfect),
with higher values indicating better classification performance.
import numpy as np
import plotly.graph_objects as go
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.preprocessing import StandardScaler# Assuming that X_train, X_test, y_train, y_test are already defined
# SVM requires feature scaling for better performance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Logistic Regression model
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)
lr_scores = lr_model.predict_proba(X_test)[:,1]
# SVM model
svm_model = SVC(probability=True)
svm_model.fit(X_train_scaled, y_train)
svm_scores = svm_model.predict_proba(X_test_scaled)[:,1]
# Generate ROC curve data for logistic regression model
lr_fpr, lr_tpr, lr_thresholds = roc_curve(y_test, lr_scores)
lr_auc = roc_auc_score(y_test, lr_scores)
# Generate ROC curve data for SVM model
svm_fpr, svm_tpr, svm_thresholds = roc_curve(y_test, svm_scores)
svm_auc = roc_auc_score(y_test, svm_scores)
# Generate a trace for the Logistic Regression ROC curve
trace0 = go.Scatter(
x=lr_fpr,
y=lr_tpr,
mode='lines',
name=f'Logistic Regression (Area = {lr_auc:.2f})'
)
# Generate a trace for the SVM ROC curve
trace1 = go.Scatter(
x=svm_fpr,
y=svm_tpr,
mode='lines',
name=f'SVM (Area = {svm_auc:.2f})'
)
# Diagonal line
trace2 = go.Scatter(
x=[0, 1],
y=[0, 1],
mode='lines',
name='Random (Area = 0.5)',
line=dict(dash='dash')
)
data = [trace0, trace1, trace2]
# Define layout with square aspect ratio
layout = go.Layout(
title='Receiver Operating Characteristic',
xaxis=dict(title='False Positive Rate'),
yaxis=dict(title='True Positive Rate'),
autosize=False,
width=800,
height=800,
showlegend=True
)
# Define figure and add data
fig = go.Figure(data=data, layout=layout)
# Show figure
fig.show()