![](https://crypto4nerd.com/wp-content/uploads/2023/07/0TmRpzypdQak2Ib8W-1024x703.jpeg)
I believe that multilevel models or generalized linear models are one of the best tools for predicting continuous values, given their adaptability and explanatory capacity. However, like other machine learning methods, they also pose a challenge when you need to define which variables should be used to achieve the best results.
A common strategy for refining multilevel models is the step-up approach, a method proposed by Raudenbush and Bryk in a 2002 article. This technique involves an iterative process where variables are gradually added to the model based on statistical criteria, such as the p-value and a regression coefficient like the log-likelihood, Bayesian Information Criterion (BIC), or Akaike Information Criterion (AIC).
In the step-up strategy, an initial null model is created without predictor variables to assess the statistical significance of random intercept effects. Then, the model is enriched with other variables, and this step is repeated until the final model or complete model is created. I covered this topic in an article I wrote for the conclusion of an MBA last year, and the PDF can be accessed through this link (in portuguese).
Considering a way to ensure the best possible result for a model, I followed a slightly different approach, creating a refinement method aimed at achieving the best model regardless of the step-up strategy.
This approach aims to test all possible combinations of level 1 and level 2 variables to find the best possible indicators, which are:
- Log-likelihood: a widely used method for model comparison in the academic field.
- BIC (Bayesian Information Criterion): a method that builds upon the Log-likelihood but applies a penalty based on the number of parameters in the model. Therefore, more complex models need to show a better improvement compared to models with fewer parameters.
- RMSE (Root Mean Squared Error): a measure that compares predicted values with actual values and prevents positive and negative deviations from canceling each other out. Taking advantage of the deterministic nature of regression models, where results are predictable, this indicator can be calculated while evaluating the model using the same training dataset for testing.
Below, I present a practical example of this strategy.
This method was created with the help of Python, using the “MixedLM” method from the “statsmodels” package, and it follows some naming conventions used by the package and others that complement its execution:
- y_variable: the name of the response variable, which will be generated by the model.
- formula_variables: a list of higher-level variables.
- groups: the variable used as a grouping factor.
- re_formula_variables: a list of lower-level variables.
- data: a DataFrame containing the data for model creation.
- detailed_verbose (default False): indicates whether all tested combinations will be displayed during execution (True) or only the final result (False).
- show_best_model_summary (default True): indicates whether the summary of the model will be displayed at the end of execution according to the “summary()” method.
- class_criterion (default ‘llk’): indicates the model’s classification criterion, which can be ‘llk’ for Log-likelihood, ‘bic’ for Bayesian Information Criterion, or ‘RMSE’ for Root Mean Squared Error.
- icc_test (default False): indicates whether the intraclass correlation coefficient value should be displayed, which is used to assess the percentage of data impacted by grouping at different levels.
- hlm_model_check (default False): indicates whether the p-value of the model should be displayed. This information helps identify whether the null model is statistically significant at a 95% confidence level. This way, it is possible to check the significance of the random term and determine if traditional models such as Ordinary Least Squares (OLS) are suitable for this scenario (Tabachnick and Fidell, 2013).
Considering the fictitious dataset: [Here, you would provide the fictitious dataset to be used in the example]
Given the dataset provided:
- loja: Store identifier
- cidade: City identifier
- receita: Accumulated revenue in 12 months
- nps: Net Promoter Score in 12 months
- tempo_medio_contrato: Average contract duration of the sales team
- perc_fidelidade: Percentage of loyalty club customers who made a purchase in this store in 12 months
- inflação: City’s inflation percentage in 12 months
- desemprego: City’s unemployment rate in 12 months
- pib: City’s Gross Domestic Product (GDP) in 12 months
It is possible to observe that the variables can be separated into 2 levels:
- Level 1 — Store: receita, nps, tempo_medio_contrato, and perc_fidelidade
- Level 2 — City: inflação, desemprego, and pib
With this in mind, the execution of all possible combinations for evaluating random intercepts and slopes in a multilevel model would be as follows:
hlm2_refined = hlm2_model_test(
y_variable = 'receita',
formula_variables = ['desemprego', 'inflacao', 'pib'],
groups = 'cidade',
re_formula_variables = ['nps', 'tempo_medio_contrato', 'perc_fidelidade'],
data = df_sells,
class_criterion = 'llk')
lease note that level 1 variables (store) are listed in the re_formula_variables
property, while level 2 variables (city) are listed in the re_formula
property. In this case, we will have 49 combinations [(3 + 3 + 1) * (3 + 3 + 1)], and here are some examples:
- Formula = receita ~ desemprego and re_formula = nps
- Formula = receita ~ desemprego and re_formula = nps + tempo_medio_contrato
- Formula = receita ~ desemprego + inflacao and re_formula = nps
Each combination will generate and evaluate a model, and in the end, they will be ranked according to the chosen criterion, in this case, log-likelihood. After ranking, the results of the process will be displayed, divided into two parts:
Best model: A summary of the model and classification criterion result.
Summary with the summary of the selected model using the MixedLM method.
To create a model that considers the influence of city-level variables on the stores, it is necessary to modify the method call, now considering these effects, as shown in the example below:
hlm2_final = hlm2_model_test(
y_variable = 'receita',
formula_variables = ['desemprego', 'inflacao' ,
'nps', 'tempo_medio_contrato',
'nps:desemprego', 'nps:inflacao',
'tempo_medio_contrato:desemprego', 'tempo_medio_contrato:inflacao'],
groups = 'cidade',
re_formula_variables = ['nps', 'tempo_medio_contrato'],
data = df_sells,
class_criterion = 'llk')
In this case, there will be 765 or [(8 + 28 + 56 + 70 + 56 + 28 + 8 + 1) * (2 + 1)] model combinations to be evaluated, and this process will take a little longer, depending on the computer where it is being executed. In the end, we will have the result, separated in the same way as in the previous example:
Best Model.
Summary
Please note that this time there are variables that did not show statistical significance at a 95% confidence level. However, the log-likelihood improved compared to the previous model, and the combinations without these variables did not show an improvement in predictive capability, so they were discarded.
Since the return of this method is a DataFrame with all the combinations tested, the indicators, and a copy of the model generated by the MixedLM method, it is possible to evaluate the tests performed and compare them. To obtain the best model, simply execute the command:
hlm2_final_model = hlm2_final.iloc[0]['model']
In this example, the variable hlm2_final_model
will store the model generated by the MixedLM method. To visualize the top five evaluated models, we can use the head
command, as shown in the example below:
hlm2_final.head(5)
The codes used here and more information about the dataset used can be found on my GitHub.
If you have found any errors or have any suggestions for improvement, please send me a message at marcosmhs@live.com, and I will be glad to further enhance this text and give you credit.