![](https://crypto4nerd.com/wp-content/uploads/2023/06/11c9wcJgQVcokevTI-uPx8w.png)
Let’s first review our problem statement!
The travel sector has an increasing share of the global economy and it suffered from the economic impacts of the COVID-19 pandemic. All the impacts of this event led to a necessity for more flexible and personalized travel products, like travel packages.
In the last article, we explore our dataset and discover that:
- International destinations have slightly more cancellations than national ones;
- Proportionally, cases with a high number of status changes have a bigger percentual of cancellations;
- The number of status changes is the most important feature of our dataset.
With these insights from our EDA, we’ll now apply models to estimate the likelihood of cancellations of those travel packages.
To evaluate the performance of our survival analysis models, we utilize the Concordance Index (C-index), a popular metric in survival analysis, which measures the model’s ability to rank the survival times correctly. So we can quantitatively compare different models to identify the most effective approach for predicting cancellations.
Methodology
First of all, we needed to address the needed preprocessing. The boxplot below shows the time to cancel in days for canceled cases. We considered all the cases above the upper limit of it as outliers and removed them. We also removed cases with time to cancel smaller than zero because we considered them erroneous data.
Also, from the EDA, we selected the following features based on the importance of each one of them to the cancellation event:
- ‘qty_status_changes’
- ‘operation_started’
- ‘accommodation_type’
- ‘destination_city’
- ‘destination_country’
- ‘qty_dailies’
- ‘last_fill_to_cancel_days’
- ‘time_to_cancel_days’ (time to the event)
- ‘order_canceled’ (event)
Previously we could determine that qty_status_changes is by far the most important feature of our dataset. So, to also certify the importance of the operation_started feature we applied a Log Rank Test.
The Log Rank Test is a statistical approach useful to compare two curves and verify if the groups represented by each curve are similar regarding the survival process. In our case, we used the test to verify the null hypothesis that the ‘operation started’ and ‘operation not started’ survival curves are statically equivalent:
- Null hypothesis: the curves are statistically equivalent (p_value > 0.05);
- Alternative hypothesis: the curves are not statistically equivalent.
The p_value < 0.005 indicates that we can reject the null hypothesis, which means that the survival functions are not statistically equivalent. I.e., the groups ‘operation started’ and ‘operation not started’ are not equivalent in the surviving process.
Survival times
We start verifying the median survival time through the Kaplan-Meier Curve. The plot below shows the graphic representation of the survival rate for the time_to_cancel_days variable, our time to the event.
From the curve below, it’s possible to notice that at least 60% of the orders survived after one year (365 days). The median survival time is 436 days, almost 1.2 years.
Following, we fitted and plotted KM curves considering the feature ‘operation_started.’ The curve related to the cases where the operation started clearly has a bigger median survival time than the other one. With 400 days, less than 40% of the cases which not have the operation started survived. In contrast, almost 80% of the cases that started operations survived after 400 days.
Modeling
We’ve chosen two parametric models to evaluate: the Random Survival Forest and the Gradient Boosting Survival Analysis.
The Random Survival Forest is an extension of the random forest algorithm focusing on survival analysis. As the random forest family, it can handle both categorical and continuous features without extensive preprocessing, and it is robust to outliers and missing data.
And the Gradient Boosting Survival Analysis is an application of the gradient boosting algorithm to survival analysis. It can capture the non-linear relationship between variables, and it’s known for its high accuracy.
To guarantee a good performance of our models, we employed hyperparameter tunning through GridSearchCV considering the following parameters:
Results
To evaluate the models we used the Concordance Index, a metric that measures the model’s ability to rank subjects according to their survival times. We achieved a good level of C-index in both cases, but the Gradient Boosting Survival Analysis outperformed the Random Survival Forest.
This can be due to the gradient boosting algorithms’ flexibility regarding identifying complex non-linear relationships between features. The data can have intricate patterns that the random forest algorithm couldn’t identify.
Conclusion
In conclusion, our solution employing survival analysis techniques showcased promising results for predicting travel package cancellations.
We explored the data, understood which features are most important in the cancellation process, verified median survival time, and established a model to predict future outcomes.
As improvements, we can explore advanced techniques that could potentially enhance the accuracy and robustness of the predictions.
The findings of this project can enable proactive interventions and adaptive strategies, reducing cancellations and maximizing customer satisfaction.