![](https://crypto4nerd.com/wp-content/uploads/2023/07/1SMnsQaosWHxCbwdakXtcxg-1024x155.png)
Now let’s visualize some of our numerical features, and their relationship to the booking_status variable so that we could try to infer if there’s any direct relationship between the two.
px.histogram(train_data, x='no_of_special_requests', color='booking_status', title='No. of days that elapsed between the entering date of the booking and the arrival date')
px.histogram(train_data, x='lead_time', color='booking_status', title='No. of days that elapsed between the entering date of the booking and the arrival date')
px.histogram(train_data, x='no_of_week_nights', color='booking_status', title='Price per room, and cancelled or not')
px.histogram(train_data, x='arrival_month', color='booking_status', title='Arrival Month'
px.histogram(train_data, x='avg_price_per_room', color='booking_status', title='Price per room, and cancelled or not')
Conclusions from the plots above:
- We could observe that as lead_time increases the number of canceled bookings also increases. Lead time is the time between the booking date and the arrival date.
- We could see that the clients who were more invested in the rooms like making special requests for the room, tend to not cancel the booking because the number of canceled requests drops significantly, this can be due to an imbalance of the dataset too though.
- We can’t observe any significant difference in bookings canceled and not in the features above aside from lead_time.
Visualizing Numerical features vs Booking Status
Another way of visualizing the relationship between our numerical data and the target column booking status is using boxplots. It’s important to understand boxplots before continuing further. For more info on boxplots: here.
# Assuming 'df' is your DataFrame and 'booking_status' is your target variable
# numerical_features = df.select_dtypes(include=['int64', 'float64']).columnsfor feature in cols_num:
plt.figure(figsize=(10,6))
sns.boxplot(x='booking_status', y=feature, data=train_data)
plt.title(f'Boxplot of {feature} vs Booking Status')
plt.show()
Conclusions from the plots:
From the plots above, we could conclude that features like lead_time, and no_of_weekend_nights, seem to correlate with our booking_status. The greater the time between the reserved date and arrival date, the more people have canceled their reservation.
px.histogram(train_data, x='market_segment_type', color='booking_status', title='Market segment designation')px.histogram(train_data, x='room_type_reserved', color='booking_status', title='Room Type reserved')px.histogram(train_data, x='type_of_meal_plan', color='booking_status', title='Meal plan type')
We can use the code above to visualize the relationships between categorical features and the target variable booking status. Based on these plots we can’t observe any significant relationship. Plots below:
Correlation Matrix
correlation_matrix = train_data.corr()plt.figure(figsize=(20, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation heatmap")
plt.show()
As we can see from the correlation matrix, one of the most important features is lead_time. Also, features with negative correlation coefficients like arrival_month, arrival_date, repeated_guest, etc. will be removed from our analysis.
From the plot above we could observe that different months have different numbers of reservations and also different cancelation rates.
Handling Missing Values
Handling missing values is a very important step in data preprocessing. The strategy to handle these missing values depends based on the type of data and the nature of the problem.
For numerical data, it’s common to use the following techniques:
- Drop: Remove rows with missing values.
- Mode: Replace missing values with the mode (most frequent value).
- Random: Same as for numerical data.
- KNN: Can also be used for categorical data, where the most common class among the K-nearest neighbors replaces the missing value. Choosing the right method depends on the specific data, the importance of the feature, and the proportion of missing values.
There are more techniques for handling missing values, and they are really important depending on the problem you might experiment with different approaches.
Defining our train_data and test_data which are gonna be used for training and testing our model.
train_data = train_data[cols_num+cols_cat+['booking_status']]
test_data = test_data[cols_num+cols_cat]train_data = train_data.drop('Booking_ID', axis=1)
test_data = test_data.drop('Booking_ID', axis=1)train_data.info() ## outputs the information below<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32647 entries, 0 to 32646
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 no_of_adults 28231 non-null float64
1 no_of_children 5043 non-null float64
2 no_of_weekend_nights 7729 non-null float64
3 no_of_week_nights 24287 non-null float64
4 lead_time 6935 non-null float64
5 arrival_month 31740 non-null float64
6 avg_price_per_room 9058 non-null float64
7 type_of_meal_plan 16544 non-null object
8 room_type_reserved 11360 non-null object
9 market_segment_type 18526 non-null object
10 booking_status 32295 non-null float64
dtypes: float64(8), object(3)
memory usage: 2.7+ MB
Since there are not too many missing values in the booking_status column, we will remove all rows which have the booking status missing, since that’s the variable we’re trying to predict and it’s important to have the correct information on booking status only.
We will check the missing data again, and also plot the data before imputing the missing values, we plot it before, to check afterward if the distribution of the data is still similar to the one before (we can’t make huge changes when imputing the values, since that will artificially modify the dataset and lose the real representation of the data i.e might lead to bias on prediction).
train_data.isnull().mean().sort_values()booking_status 0.010782
arrival_month 0.027782
no_of_adults 0.135265
no_of_week_nights 0.256073
market_segment_type 0.432536
type_of_meal_plan 0.493246
room_type_reserved 0.652035
avg_price_per_room 0.722547
no_of_weekend_nights 0.763255
lead_time 0.787576
no_of_children 0.845529
dtype: float64train_data.hist(bins=20, figsize=(20,15))
Missing Values Implementation and Usage
def handle_missing(table, columns = None, method = 'drop'):
table = table.copy()
if columns == None:
columns = table.columns
for col in columns:
if method == 'drop':
table[col].dropna(inplace=True)
elif method == 'mode':
table[col].fillna(table[col].mode()[0], inplace = True)
elif method == 'median':
table[col].fillna(table[col].median(), inplace = True)
elif method == 'mean':
table[col].fillna(table[col].mean(), inplace = True)
elif method == 'random':
table[col] = table[col].apply(lambda x: np.random.choice(table[col].dropna().values) if np.isnan(x) else x)
return table## Remove no_of_children column, too many missing values
train_data = train_data.drop('no_of_children', axis=1)
test_data = test_data.drop('no_of_children', axis=1)from sklearn.impute import KNNImputer
# instantiate the imputer, let's say with k=3 (you may tune k as needed)
knn_imputer = KNNImputer(n_neighbors=3)# select the columns to impute
cols_to_impute = ['avg_price_per_room']# apply the imputer
train_data[cols_to_impute] = knn_imputer.fit_transform(train_data[cols_to_impute])
test_data[cols_to_impute] = knn_imputer.transform(test_data[cols_to_impute])# Check the result
avg_price_per_room 0
print(train_data[cols_to_impute].isnull().sum())
dtype: int64train_data = handle_missing(train_data, columns=['no_of_week_nights','no_of_weekend_nights','lead_time', "arrival_month"], method = 'random')
test_data = handle_missing(test_data, columns=['no_of_week_nights','no_of_weekend_nights','lead_time', "arrival_month"], method = 'random')train_data = handle_missing(train_data, columns=[ "no_of_adults",
"type_of_meal_plan","market_segment_type", "room_type_reserved"
],
method = 'mode')
test_data = handle_missing(test_data, columns=[ "no_of_adults",
"type_of_meal_plan","market_segment_type", "room_type_reserved"
],
method = 'mode')train_data.hist(bins=20, figsize=(20,15))array([[<AxesSubplot:title={'center':'no_of_adults'}>,
<AxesSubplot:title={'center':'no_of_weekend_nights'}>,
<AxesSubplot:title={'center':'no_of_week_nights'}>],
[<AxesSubplot:title={'center':'lead_time'}>,
<AxesSubplot:title={'center':'arrival_month'}>,
<AxesSubplot:title={'center':'avg_price_per_room'}>],
[<AxesSubplot:title={'center':'booking_status'}>, <AxesSubplot:>,
<AxesSubplot:>]], dtype=object)
Defining our Input Train_Data Set
Checking again, if we successfully handled the missing values.
train_data.isnull().mean().sort_values()no_of_adults 0.000000
no_of_weekend_nights 0.000000
no_of_week_nights 0.000000
lead_time 0.000000
arrival_month 0.000000
avg_price_per_room 0.000000
type_of_meal_plan 0.000000
room_type_reserved 0.000000
market_segment_type 0.000000
booking_status 0.010782
dtype: float64train_data = train_data.dropna(subset=['booking_status'])train_data.isnull().mean().sort_values()no_of_adults 0.0
no_of_weekend_nights 0.0
no_of_week_nights 0.0
lead_time 0.0
arrival_month 0.0
avg_price_per_room 0.0
type_of_meal_plan 0.0
room_type_reserved 0.0
market_segment_type 0.0
booking_status 0.0
dtype: float64train_data.describe()
Data Normalization
Scaling numerical features to a certain range (like 0 to 1 or -1 to 1) is a good practice in machine learning. It helps ensure that all features contribute equally to the model’s prediction by preventing any single feature from dominating due to its larger scale. Additionally, it makes optimization algorithms more effective, as they usually function better with smaller numbers.
For example:
Consider two data points:
Person A: Age = 25, Income = $50,000
Person B: Age = 50, Income = $100,000
Without scaling, the income feature would overpower the age feature due to its larger values, affecting our model’s learning.
By applying min-max scaling, we adjust the values:
Person A: Scaled Age = 0, Scaled Income = 0
Person B: Scaled Age = 1, Scaled Income = 1
Now, both features have the same range, allowing the model to learn from both without bias.
cols_num.remove('no_of_children')from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()
scaler.fit(train_data[cols_num])train_data[cols_num] = scaler.transform(train_data[cols_num])
test_data[cols_num] = scaler.transform(test_data[cols_num])
Encoding Categorical Data
Before we can apply machine learning algorithms to categorical variables, we need to transform them into numerical form. This process is known as encoding.
For instance, take the ‘Transmission’ column, which contains ‘Manual’ and ‘Automatic’. Given that there are only two categories, we can use binary encoding: assign ‘0’ to ‘Manual’ and ‘1’ to ‘Automatic’, or vice versa.
In the case of ‘Fuel_Type’ with three categories, we can apply One-Hot-Encoding. Here, each category gets its column in the data, and these new columns are binary, indicating the presence (1) or absence (0) of that category for a given record.
One-Hot-Encoding is particularly useful when the categories do not have a natural order or hierarchy, as is the case with ‘Fuel_Type’. It prevents the machine learning algorithm from assigning inappropriate weight or importance to the categories based on a numerical value.
However, one should be cautious when using One-Hot-Encoding with a variable that has many categories. This is because it can lead to a high increase in the number of columns (dimensionality) in your dataset, making it sparse and potentially harder to work with — a situation often referred to as the “Curse of Dimensionality”. In such situations, other encoding techniques such as ordinal encoding or target encoding might be more appropriate.
An example of One Hot Encoding:
OneHotEncoder Implementation using sklearn library
cols_cat.remove('Booking_ID')from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False, handle_unknown = 'ignore')
encoder.fit(train_data[cols_cat])encoded_columns = list(encoder.get_feature_names_out(cols_cat))train_data[encoded_columns] = encoder.transform(train_data[cols_cat])
test_data[encoded_columns] = encoder.transform(test_data[cols_cat])train_data = train_data.drop(cols_cat, axis=1)
test_data = test_data.drop(cols_cat, axis=1)
Handling Class Imbalance
SMOTE stands for Synthetic Minority Over-sampling Technique. It’s a technique used to handle imbalanced datasets, which are quite common in real-world scenarios. Imbalanced data typically refers to a classification problem where the number of examples (rows) per class is not evenly distributed. Often, you’ll have a large amount of data/observations available for one class (referred to as the majority class) and less data for one or more other classes (referred to as the minority classes).
In such scenarios, many machine learning models tend to be overwhelmed by the majority class and ignore the minority class (resulting in a biased prediction). This is a problem because typically, the minority class is more interesting (and considered more important) than the majority class. For instance, in this hotel reservation scenario, the number of canceled reservations (minority) is often much smaller than the number of not canceled reservations.
Here’s how SMOTE works:
SMOTE creates synthetic observations of the minority class by:
Choosing a minority class observation at random. Finding its k-nearest minority class neighbors. The number of neighbors is specified as a parameter to SMOTE. Choosing one of these neighbors and placing a synthetic point anywhere on the line joining the observation and its chosen neighbor. By oversampling the minority class in this way, SMOTE helps to “level the playing field” and allows the model to learn more about the minority class characteristics.
from imblearn.over_sampling import SMOTEX = train_data.drop('booking_status', axis=1)
y = train_data['booking_status']
smote = SMOTE(random_state=42)
# fit predictor and target variable
x_smote, y_smote = smote.fit_resample(X, y)
Training Our Model using XGBoost
The XGBoost model starts by making a very simple guess to predict the output based on our input features. This is usually just the average of the target column we’re trying to predict.
After making this initial guess, the model calculates how far off (the error) each of these initial predictions was from the actual values. This difference is called the “residual”.
Next, the model creates a “decision tree”, which is a flowchart-like structure of questions about our input data (it’s important to understand how DTs make predictions). However, this tree isn’t designed to predict the target directly. Instead, it’s designed to predict the residuals which are the errors made by our initial simple guess.
The predictions made by this tree are then scaled down, or “dampened”, by a factor known as the “learning rate”.
These scaled-down tree predictions are then added to our initial predictions, improving them a bit.
(Each new tree tries to correct the mistakes (residuals) of the previous trees. If we just added these corrections directly (without scaling them down), we might overcorrect and make the model too complex, leading to overfitting. Overfitting is when the model learns the training data too well, including its noise and outliers, and performs poorly on new, unseen data. To avoid this, we multiply the corrections from each new tree by the learning rate before adding them to our predictions. This “dampens” the corrections, making the model learn more slowly and helping to prevent overfitting.)
The model then repeats steps 2 to 5, each time creating a new decision tree to correct the errors of the previous one.
So, as a summary, XGBoost is like a team of decision trees where each new member learns from the mistakes of the previous members, aiming to continually get better at predicting the target variable.
For more information / mathematical explanation refer to: https://xgboost.readthedocs.io/en/stable/
Or you can check out the Ensemble Methods chapter of a book: Charu C. Aggarwal — Data Mining The Textbook
X_train
The table above would be our final table that will be used for training the model, it doesn’t have missing values, and it has some engineered features based on the transformations we did earlier. We’re also using RandomizedSearchCV to find the best parameters for our model.
Hyperparameter tuning and prediction on the validation set.
GridSearchCV and RandomizedSearchCV are two methods that can be used from sklearn for hyperparameter tuning. They are used to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions.
- GridSearchCV: This method performs an exhaustive search over specified parameter values for an estimator. It trains the model for each combination of the hyperparameters and retains the best combination. For example, if you specify max_depth values as [1, 2, 3] and n_estimators as [50, 100, 200], then GridSearchCV will try all combinations of [(1, 50), (1, 100), (1, 200), (2, 50), (2, 100), (2, 200), (3, 50), (3, 100), (3, 200)] and return the set of parameters with the best performance metric. The downside is that it can be very time-consuming for larger datasets or/and for too many parameters specified.
- RandomizedSearchCV: This method is a random search on hyperparameters. The RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. Given enough time, RandomizedSearchCV will find as good or better parameters as GridSearchCV. This is usually faster and leads to similar results as GridSearchCV.
from sklearn.model_selection import train_test_splitX_train, X_val, Y_train, Y_val = train_test_split(x_smote, y_smote, test_size=0.2, random_state=42)
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler# Create a pipeline
pipe = Pipeline([
('clf', xgb.XGBClassifier(use_label_encoder=False))
])params={
"clf__learning_rate" : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ],
"clf__max_depth" : [ 3, 4, 5, 6, 8, 10, 12, 15],
"clf__min_child_weight" : [ 1, 3, 5, 7 ],
"clf__gamma" : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
"clf__colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ],
"clf__subsample" : [0.6, 0.7, 0.8, 0.9, 1.0],
"clf__reg_alpha" : [0, 0.001, 0.005, 0.01, 0.05],
"clf__reg_lambda" : [0.01, 0.1, 1.0, 10.0, 100.0]
}# Create the GridSearchCV object
cv = RandomizedSearchCV(pipe, params, cv=5, scoring='accuracy')# Fit to the training set
cv.fit(X_train, Y_train)# Predict the labels of the validation set
Y_pred = cv.predict(X_val)# Compute and print metrics
## Accuracy: 0.7448909299655568
print("Accuracy: {}".format(cv.score(X_val, Y_val)))
print("Tuned Model Parameters: {}".format(cv.best_params_))
## Tuned Model Parameters: {'clf__subsample': 0.7, 'clf__reg_lambda': 1.0, 'clf__reg_alpha': 0.001, 'clf__min_child_weight': 1, 'clf__max_depth': 15, 'clf__learning_rate': 0.2, 'clf__gamma': 0.4, 'clf__colsample_bytree': 0.4}
We could see that the accuracy of our model is 0.74 in our validation data, now this model is ready to be used on Test Data. The hotels can use this model, to find out (predict) if their customer is going to hold onto the reservation they made or not.