AWS Machine Learning on Fifa Matches predictions | by Desmond Onam

In this blog, I look at these concepts in Machine Learning:

1. Cloud feasibility study

2. Data analysis and opportunity Identification

3. Data pre-processing

4. Model selection and training

5. Model evaluation and visualization

6. Model Deployment

7. Conclusion.

Introduction

In a Machine Learning project, we have hosted everything in the cloud and there is no need for many installations in our local machines. This is because the computing power that we need is already provided for us on the cloud platform that is AWS. There is a need to know the packages that we are needed to have on the cloud platform. Thus this will help us to make sure that we can install them in the cloud platform. Since we have Jupyter notebook ready, there is no need to worry as most of the packages that are needed will only require a pip installation or just an import statement. With this in the introduction, will help us to get to know the project better and you can follow through in this blog.

Cloud Feasibility study.

Machine Learning in the cloud is one of the trending topics as most of the models that are being produced today need to be deployed in cloud servers. This helps in ensuring that the models can run within the time frame that they are meant to run. Cloud servers provide space and also computing powers that are not readily available in computers. When one uses the computing power that is needed for a model to run, this can help save time and also your computer is not strained.

In this blog, we look into the way, and it is also documentation that one can use as a guide to ensure that they can see how machine learning implementation in the cloud can be done.

AWS is a cloud computing service that provides on-demand computing resources for storage, networking, Machine learning, etc on a pay-as-you-go pricing model. AWS is a premier cloud computing platform around the globe, and most organization uses AWS for global networking and data storage. The documentation is for those Machine learning practitioners who know the model building and have even deployed some projects on other platforms but want to learn how to deploy on major cloud platforms like AWS.

In the documentation, our main aim is to learn deployment over AWS, but we will walk through each step from the development of the Machine learning model to deployment over AWS.

In this analysis, we are presented with a dataset that is FIFA world cup 2022. The analysis of this data shows that we have several teams that we are to analyse and predict the winning team. For this to be effective, there is a need to come up with the exploration of the data. The steps that we take to ensure that this is done is to get the data from Kaggle and load the data to our notebook, and this notebook is what we’re to deploy in the AWS cloud.

Just like in all the analyses, the data is prepared for modelling, by getting the data and loading it into a notebook. After this, the data is looked at to find if there is any missing data. From the data that we are presented with, there is no missing data and hence we proceed to make sure we understand the features of the data that we are presented with. Looking at the dates of the match helps us to understand the data that we are having in a clear sense. The data presented contains matches that were played till the year 2018. This can help us in the prediction from the year 1992.

Since this data is the one we are to use, we perform a model to this data so that we can look at the types of data that we are having.

Deploying Model in Cloud

NIST defines four cloud deployment models:

Public cloud, private cloud, community cloud, hybrid cloud. A cloud delivery model is defined by where the deployment infrastructure resides and who controls that infrastructure. Deciding which deployment model to choose is one of the most important cloud deployment decisions.

Each cloud deployment model meets the needs of different organizations, so it’s important to choose a model that meets your organization’s needs. Perhaps more importantly, each cloud delivery model has a different value proposition and associated costs. So often choosing a cloud deployment model simply comes down to money. Either way, we need to be aware of the peculiarities of each environment to make an informed decision.

In the world of machine learning, data scientists (DS) currently fill one or both of two key roles:

Data Analysis and opportunity identification

When DS receives a data dump, it applies machine learning algorithms to the data and returns the results in the form of a presentation or report.

DS creates software that stakeholders can use to leverage machine learning models.

The workflow can be broken down into the following basic steps:

Training a machine learning model on a local system.
Wrapping the inference logic into a flask application.
Using docker to containerize the flask application.
Hosting the docker container on an AWS ec2 instance and consuming the web service.

For us to be able to use AWS, here are the steps we need to ensure we have:

Create EC2 Instance

EC2 stands for Elastic compute cloud that provides scalable compute capacity in AWS Cloud. You can also use AWS Lambda and AWS Elastic Beanstalk for deploying ML models, but EC2 is very old. You can install software on EC2, and easy to deploy your application. It is nothing but you take a server on the cloud for rent. Open EC2 Dashboard in your management console through the search tab or services. As you open the EC2 dashboard, you will observe how many instances are running. Click on the Launch Instance tab to launch a new EC2 Instance. Now by configuring 6–7 steps, you can launch your new Instance.

If you do have an AWS account, you will sign in to the AWS account and then choose the instance where you want to run the image. This is as indicated in the screenshot below.

Step-1) Choose an Amazon Machine Image (AMI)

The first step is to choose an operating system, for instance, where you have many options. We will choose the operating system that is free-tier eligible, so click on Ubuntu and select the Ubuntu operating system which is free-tier eligible. You can also try any supported version of Linux.

Step-2) Choose Instance Type

The step is crucial, and you do not need to select anything in the step because the default option is already selected as t2.micro, which supports a free-tier account.

step-3) Create Key-Pair

To secure your instance, we have to set a password and key-pair to achieve this. Click on create new key pair and enter any name of your choice without space. And click on create-key-pair.

Step-4) Network settings / Configure security group

Under this, we have different network settings but keep all the settings as default and move forward. Your AWS EC2 instance is private to you and secretly kept unless you define to accept the network traffic from what kind of source, so we create a new and default security group.

step-5) Configure storage

Free-tier eligible customers can get up to 30 GB EBS, and a minimum of 8 GB is allotted to you. We have set it to 8GB which is the default.

step-6) Review and Launch

At last, you have a summary of all your configurations so you can have a look and launch an EC2 instance. After that, if you visit the instances tab then our EC2 instance is successfully and running.

Download and install Putty and WinSCP

WinSCP is used to upload your project files to the server.

Putty is a remote client where by using the SSH key, you take access to your machine. You can open an EC2 command prompt in Putty and install the project dependencies and libraries. To download the putty, visit the official site and download the installer file as per your OS.

Upload the website

Open the WinSCP, and you need to provide the hostname and the key-value pair we have created above. If the file is in PEM format, it will ask you to convert it to PPK format and click on OK. Click on login, and when you are successfully logged in, then you will have 2 parallel screens, where the first displays the files of your local system, and the second one is your EC2 instance in a local system. Open the desired project folder and paste each file on the right side (Server folder).

Install Python and Libraries on Server (EC2 Instance)

We will need Putty for this, so open the Putty using the same key-value pair. so one by one, you have to install each library on the EC2 server.

Install PIP using which we can install all the python libraries on AWS EC2. As we successfully install PIP, we can run the pip command to install all the required libraries specified in the requirements file.

Data Pre-processing

With the use of AWS sage-maker, we can create a model instance which we can use to make sure that we have our model working. This is possible by the use of machine learning models that we chose to use in our model. We can build train and deploy models in the AWS notebook from these steps.

Create a Sage Maker notebook instance
Prepare the data
Train the model to learn from the data
Deploy the model
Evaluate your ML model’s performance

In this documentation, we look at each of these steps.

By selecting the one that you are to use and giving it its name, you are ready to move to the next step. However, it is important to note the following when you are creating the AWS notebook in the sage-maker. One can launch the sage-maker from this instance in the services.

After setting up the notebook, this is the page that you will see to show that you have successfully set up the AWS notebook and now you are ready to make sure you prepare your model, run the machine learning model and also deploy it in the AWS.

Now it is time to open the Jupyter labs so that we can run the AWS in the notebook. This is done by pressing the Jupyter as indicated in the screenshot above.

We compare 9 different modelling approaches for the soccer matches and goal differences on all international matches from 2005–2017, FIFA World Cup 2010–2014 and FIFA EURO 2012–2016. Within this comparison, while the performance of “Win / Draw / Lose” predictions show not much difference, “Goal Difference” prediction is quite favoured to Random Forest and squad-strength-based decision tree. We also apply these models in World Cup 2018 and again, Random Forest and Logistic Regression predict about 33% accuracy for “Goal Difference” and about 57% for “Win / Draw / Lose”. However, a simple decision tree based on bet odd and squad strength is also comparable.

Here we are:

Prediction of the winner of an international matches Prediction results are “Win / Lose / Draw” or “goal difference”

Apply the model to predict the result of the FIFA world cup 2018.

This is the method that we are to use in the preparation and running of the ML model we are to come up with in this project. The Kaggle data we’re using is the one that is provided for us. This is to ensure we are as accurate as possible. This is also to make sure that we have the correct number of matches that can be used to create a model to test the model appropriately and to make sure that the model is working as correctly as possible. The matches we have in the data are done from 1972 to the last that was played in 2018.

After loading the data to the AWS to use in the notebook of Jupyter sage-maker, we are to perform a feature extraction. Feature Selection: To determine who will more likely to win a match, based on my knowledge, I come up with 4 main groups of features as follows:

head-to-head match history between 2 teams. Some teams have few opponents who they hardly win no matter how strong they currently are. For example, the German team usually loses / couldn’t beat the Italian team in 90-minute matches.

The recent performance of each team (10 recent matches), aka “form” A team with “good” form usually has a higher chance to win the next matches.

Bet-ratio before matches Odd bookmarkers already did many analyses before matches to select the best betting odds so why don’t we include them?

Squad strength (from FIFA video game). We want a real squad strength but these data are not free and not always available so we use the strength from FIFA video games which have been updated regularly to catch up with the real strength.

Feature List

The feature list reflects those four factors.

*difference: team1 — team2

*form: performance in 10 recent matches.

A look at the data can tell us what the data is all about.

Exploratory Data Analysis

There are a few questions to understand the data better

Imbalance of data

Correlation between variables

First, we draw a correlation matrix of the large dataset which contains all matches from 2005–2018 with features groups 1,2 and 3

In general, features are not correlated. “odd_win_diff” is quite negatively correlated with “form_diff_win” (-0.5), indicating that the form of two teams reflexes the belief of odd bookmarkers on winners. One more interesting point is when the difference of bet odd increases we would see more goal

Second, we draw a correlation matrix of the small dataset which contains all matches from World Cup 2010, 2014, 2018 and EURO 2012, 2016

Overall rating is just an average of the “attack”, “defence” and “midfield” index therefore we see a high correlation between them. In addition, some of the new features of squad strength show a high correlation for example “FIFA Rank”, “Overall rating” and “Difference in winning odd”

How does head-to-head matchup history affect the current match?

You may think when the head-to-head win difference is positive, the match result should be “Win” (Team 1 wins Team 2) and vice versa, when the head-to-head win difference is negative, the match result should be “Lose” (Team 2 wins Team 1). A positive head-to-head win difference indicates that there is a 51.8% chance the match results end up with a “Win” and a negative head-to-head win difference indicates that there is a 55.5% chance

Let’s perform our hypothesis testing with a two-sampled t-test Null Hypothesis: There is no difference of ‘h2h win difference’ between “Win” and “Lose” Alternative Hypothesis: There are differences of ‘h2h win difference’ between “Win” and “Lose”

The T-test between a win and lose:

Ttest_indResult(statistic=24.30496036405259, pvalue=2.503882847793891e-126)

A very small p-value means we can reject the null hypothesis and accept the alternative hypothesis.

We can do the same procedure with win-draw and lose-draw

The T-test between a win and a draw:

Ttest_indResult(statistic=7.8385466293651023, pvalue=5.395456011352264e-15)

The T-test between lose and draw:

Ttest_indResult(statistic=-8.6759649601068887, pvalue=5.2722587025773183e-18)

Therefore, we can say the history of head-to-head matches between the two teams contribute significantly to the result

How 10-recent performances affect the current match?

We consider differences in “Goal For” (how many goals they got), “Goal Against” (how many goals they conceded), “several winning matches” and “number of drawing matches”. We performed the same procedure as the previous questions. From the pie charts, we can see a clear distinction in the “number of wins” where the proportion of the “Win” result decreases from 49% to 25% while the “Lose” result increases from 26.5% to 52.3%.

Pie charts are not enough we should do the hypothesis testing to see the significance of each feature

Feature Name t-test between ‘win’ and ‘lose’ t-test between ‘win’ and ‘draw’ t-test between ‘lose’ and ‘draw’

Goal For pvalue = 2.50e-126 pvalue = 5.39e-15 pvalue = 5.27e-18

Goal Against pvalue = 0.60 pvalue = 0.17 pvalue = 0.08

Number of Winning Matches pvalue = 3.02e-23 pvalue = 1.58e-33 pvalue = 2.57e-29

Number of Draw Matches p-value = 1.53e-06 p-value = 0.21 p-value = 0.03

We see a much small value of p-value in cases of “Goal For” and “Number of Winning Matches”. Based on the t-test, we know differences in “Goal For” and “Number of Winning Matches” are helpful features

Do stronger teams usually win?

We define stronger groups based on

Higher FIFA Ranking

Higher Overall Rating

FIFA Rank pvalue = 2.11e-10 pvalue=0.65 pvalue=0.00068

Overall Rating pvalue = 1.53e-16 pvalue = 0.0804 pvalue = 0.000696

Do young players play better than old one ?

Young players may have better stamina and more energy while older players have more experience. We want to see how age affects match results.

Age pvalue = 2.07e-05 pvalue = 0.312 pvalue=0.090

Based on the t-test and pie chart, we know that age contributes significantly to the result. More specifically, younger teams tend to play better than older ones

Is a short pass better than a long pass? A higher value of “Build Up Play Passing” means “Long Pass” and a lower value means “Short Pass”, a value in the middle means “Mixed-Type Pass”

Age pvalue = 1.05e-07 pvalue = 0.0062 pvalue = 0.571

Model selection and training.

Based on the t-test and pie chart, we know that age contributes significantly to the result. More specifically, teams who replies on “Longer Pass” usually lose the game.

How does crossing pass affect the match result?

How does chance creation shooting affect match results?

How does defence pressure affect match results?

How does defence aggression affect match results?

How does the defence team width affect the match results?

How labels distribute in reduced dimensions?

For this question, we use PCA to pick the two first principal components which best explained data. Then we plot data in a new dimension

While “Win” and “Lose” are separate, “Draw” seems to be mixed between other labels.

Our main objectives of prediction are “Win / Lose / Draw” and “Goal Difference”. In this work, we do two main experiments, for each experiment we follow these procedures

Split data into 70:30

First, we perform “normalization” of features, and convert the category to the number.

Second, we perform k-fold cross-validation to select the best parameters for each model based on some criteria.

Third, we use the best model to do prediction on 10-fold cross-validation (9 folds for training and 1 fold for testing) to achieve the mean of test error. This error is more reliable.

Experiment 1. Build classifiers for “Win / Lose / Draw” from 2005. Because the feature “Bet Odds” is only available after 2005 so we only conduct experiments for this period.

Experiment 2. Build classifiers for “Goal Difference” for “World Cup” and “UEFA EURO” after 2010. The reason is that features of “Squad Strength” are not always available before 2010, some national teams do not have a database of squad strength in FIFA Video Games. We know that tackling prediction with regression would be hard so we turn “Goal Difference” into classification by defining labels as follows:

Team A vs Team B

“win_1”: A wins with 1 goal difference

“win_2”: A wins with 2 goal differences

“win_3”: A wins with 3 or more goal differences

“lose_1”: B wins with 1 goal difference

“lose_2”: B wins with 2 goal differences

“lose_3”: A wins with 3 or more goal differences

“draw_0”: Draw

Experiment 3. In addition, we want to test how our trained model in Experiment 2 predicts the “Goal Difference” and “Win/Draw/Lose” of matches in World Cup 2018.

Baseline Model: In the EDA part, we already investigate the importance of features and see that odd, history, form and squad strength are all significant. Now we divide features into three groups: odd, h2h-form, and squad strength and build “Baseline Models” based on these groups. To keep the baseline model simple, we set the hyper-parameter of Decision Tree maximum depth = 2, maximum leaf nodes = 3

To beat the baseline models we use all features and several machine algorithms as follows

Logistic Regression

Random Forest

Gradient Boosting Tree

ADA Boost Tree

Neural Network

LightGBM

Evaluation Criteria

Models are evaluated on these criteria which are carried out for each label “win”, “lose” and “draw”

Precision: Among our predictions of the “True” value, how many percentages did we hit? the higher value, the better prediction

Recall: Among actual “True” values, how many percentages did we hit? the higher value, the better prediction

F1: A balance of Precision and Recall, the higher value, the better prediction, there are 2 types of F1

F1-micro: compute F1 by aggregating True Positive and False Positive or each class

F1-macro: compute F1 independently for each class and take the average (all classed equally)

In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes). In this case, we should stick with F1-micro

10-fold cross-validation accuracy: Mean of accuracy for each cross-validation fold. This is a reliable estimation of test error of model evaluation (no need to split to train and test)

The area under ROC: For binary classification, True Positive Rate vs False Positive Rate for all thresholds.

After producing our model and testing it, it is easy to serve the model to the cloud.

Now that we have the model, one can decide to have the model in their machine and then deploy it in the cloud later or directly deploy the model to the cloud.

Results

Experiment 1 “Draw / Lose /Win”

Model 10-fold CV accuracy (%) F1 — micro average AUROC — micron average

Odd-based Decision Tree 59.28 60.22 0.76

H2H-Form-based Decision Tree 51.22 51.52 0.66

Logistic Regression 59.37 59.87 0.76

Random Forest 54.40 55.92 0.74

Gradient Boosting tree 58.60 59.47 0.77

ADA boost tree 59.08 60.22 0.77

Neural Net 58.96 58.36 0.77

LightGBM 59.49 60.28 0.78

Results from experiment 1 show little improvement between enhanced models and baseline models based on three evaluation criteria: 10-fold cross-validation, F1 and Area Under Curve. A simple Odd-based Decision Tree is enough to classify Win/Draw/Lose. However, according to the confusion matrix in the Appendix of experiment 1, we see that most of the classifiers failed to classify the “Draw” label, only Random Forest and Gradient Boosting Tree can predict the “Draw” label, with 74 hits and 29 hits respectively. Furthermore, as we mentioned, there is not much difference in classifiers in other criteria so our recommendation for classifying “Win / Draw / Lose” is “Gradient Boosting Tree” and “Random Forest”

Experiment 2 “Goal Difference”

Model 10-fold CV accuracy (%) F1 — micro average AUROC — micron average

Odd-based Decision Tree 26.41 25.37 0.62

H2H-Form-based Decision Tree 16.74 18.94 0.59

Squad-strength-based Decision Tree 31.64 31.34 0.66

Logistic Regression 21.39 22.38 0.64

Random Forest 25.36 25.37 0.60

Gradient Boosting tree 27.27 16.42 0.58

ADA boost tree 26.92 16.41 0.59

Neural Net 22.42 25.37 0.63

LightGBM 25.62 20.89 0.57

In experiment 2, the “Squad Strength” based Decision Tree tends to be superior to other classifiers.

Experiment 3 “Goal Difference” and “Win/Draw/Lose” in World Cup 2018

Model “Goal Difference” Accuracy “Win/Draw/Lose” Accuracy (%) F1 — micron average

Odd-based Decision Tree 31.25 48.43 31.25

H2H-Form-based Decision Tree 25.00 34.37 25.00

Squad strength based Decision Tree 28.12 43.75 28.12

Logistic Regression 32.81 57.81 32.81

Random Forest 32.81 56.25 32.81

Gradient Boosting tree 21.87 45.31 21.87

ADA boost tree 28.12 51.56 28.12

Neural Net 20.31 35.94 20.31

LightGBM 32.81 56.25 32.81

The results of our experiments in the model are as follows. This result comparison helps us to make the right choice of the model to sure so that we can deploy it in the server to choose it to use in the prediction.

Here are the results:

Experiment 1

Odd-based Decision Tree:

h2h-Form-based Decision Tree:

Best parameters:

LogisticRegression(C=0.002154434690031882, class_weight=None, dual=False,

fit_intercept=True, intercept_scaling=1, max_iter=100,

multi_class=’multinomial’, n_jobs=1, penalty=’l2′,

random_state=None, solver=’lbfgs’, tol=0.0001, verbose=0,

warm_start=False)

Random Forest

Best parameters:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion=’gini’,

max_depth=None, max_features=’auto’, max_leaf_nodes=None,

min_impurity_decrease=0.0, min_impurity_split=None,

min_samples_leaf=1, min_samples_split=2,

min_weight_fraction_leaf=0.0, n_estimators=15, n_jobs=1,

oob_score=False, random_state=85, verbose=0, warm_start=False)

Gradient Boosting tree

Best parameters:

GradientBoostingClassifier(criterion=’friedman_mse’, init=None,

learning_rate=0.1, loss=’deviance’, max_depth=3,

max_features=None, max_leaf_nodes=None,

min_impurity_decrease=0.0, min_impurity_split=None,

min_samples_leaf=1, min_samples_split=2,

min_weight_fraction_leaf=0.0, n_estimators=100,

presort=’auto’, random_state=0, subsample=1.0, verbose=False,

warm_start=False)

ADA boost tree

AdaBoostClassifier(algorithm=’SAMME’,

base_estimator=DecisionTreeClassifier(class_weight=None, criterion=’gini’, max_depth=3,

max_features=None, max_leaf_nodes=None,

min_impurity_decrease=0.0, min_impurity_split=None,

min_samples_leaf=1, min_samples_split=2,

min_weight_fraction_leaf=0.0, presort=False, random_state=None,

splitter=’best’),

learning_rate=1, n_estimators=100, random_state=0)

Neural Net

Best parameters:

MLPClassifier(activation=’relu’, alpha=0.0001, batch_size=’auto’, beta_1=0.9,

beta_2=0.999, early_stopping=False, epsilon=1e-08,

hidden_layer_sizes=(10, 5), learning_rate=’constant’,

learning_rate_init=0.1, max_iter=1000, momentum=0.9,

nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,

solver=’adam’, tol=1e-10, validation_fraction=0.1, verbose=False,

warm_start=False)

Light GBM

Best parameters:

LGBMClassifier(boosting_type=’gbdt’, class_weight=None, colsample_bytree=1.0,

learning_rate=0.1, max_depth=-1, min_child_samples=20,

min_child_weight=0.001, min_split_gain=0.0, n_estimators=20,

n_jobs=-1, num_leaves=31, objective=None, random_state=1,

reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,

subsample_for_bin=200000, subsample_freq=0)

Experiment 2

Odd-based Decision Tree:

h2h-Form-based Decision Tree:

squad-strength-based Decision Tree:

Logistic Regression

Best parameters:

LogisticRegression(C=2.1544346900318823e-05, class_weight=None, dual=False,

fit_intercept=True, intercept_scaling=1, max_iter=100,

multi_class=’multinomial’, n_jobs=1, penalty=’l2′,

random_state=None, solver=’lbfgs’, tol=0.0001, verbose=0,

warm_start=False)

Random Forest

Best parameters:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion=’gini’,

max_depth=None, max_features=’auto’, max_leaf_nodes=None,

min_impurity_decrease=0.0, min_impurity_split=None,

min_samples_leaf=1, min_samples_split=2,

min_weight_fraction_leaf=0.0, n_estimators=15, n_jobs=1,

oob_score=False, random_state=85, verbose=0, warm_start=False)

Gradient Boosting tree

Best parameters:

GradientBoostingClassifier(criterion=’friedman_mse’, init=None,

learning_rate=0.1, loss=’deviance’, max_depth=3,

max_features=None, max_leaf_nodes=None,

min_impurity_decrease=0.0, min_impurity_split=None,

min_samples_leaf=1, min_samples_split=2,

min_weight_fraction_leaf=0.0, n_estimators=1000,

presort=’auto’, random_state=0, subsample=1.0, verbose=False,

warm_start=False)

ADA boost tree

AdaBoostClassifier(algorithm=’SAMME’,

base_estimator=DecisionTreeClassifier(class_weight=None, criterion=’gini’, max_depth=3,

max_features=None, max_leaf_nodes=None,

min_impurity_decrease=0.0, min_impurity_split=None,

min_samples_leaf=1, min_samples_split=2,

min_weight_fraction_leaf=0.0, presort=False, random_state=None,

splitter=’best’),

learning_rate=1, n_estimators=100, random_state=0)

Neural Net

Best parameters:

MLPClassifier(activation=’relu’, alpha=0.0001, batch_size=’auto’, beta_1=0.9,

beta_2=0.999, early_stopping=False, epsilon=1e-08,

hidden_layer_sizes=(30, 15), learning_rate=’constant’,

learning_rate_init=0.1, max_iter=1000, momentum=0.9,

nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,

solver=’adam’, tol=1e-10, validation_fraction=0.1, verbose=False,

warm_start=False)

Light GBM

Best parameters:

LGBMClassifier(boosting_type=’gbdt’, class_weight=None, colsample_bytree=1.0,

learning_rate=0.1, max_depth=-1, min_child_samples=20,

min_child_weight=0.001, min_split_gain=0.0, n_estimators=15,

n_jobs=-1, num_leaves=31, objective=None, random_state=1,

reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,

subsample_for_bin=200000, subsample_freq=0)

World Cup 2018 result

World Cup 2018

Now the model is applied for World Cup 2018 in Russia with simulation time = 100 000.

Result Explanation:

Team A vs Team B (only valid until the 90th minute)

“win_1”: A wins with 1 goal difference

“win_2”: A wins with 2 goal differences

“win_3”: A wins with 3 or more goal differences

“lose_1”: B wins with 1 goal difference

“lose_2”: B wins with 2 goal differences

“lose_3”: A wins with 3 or more goal differences

“draw_0”: Draw

Final and Third Place

Semi-Finals

Quarter Finals

Round of 16

Match Day 3

Match Day 2

Match Day 1

Conclusion

In conclusion, bet bookmarkers’ odd-based features are accurate in predicting match winners. Nevertheless, it does a terrible job of determining if games are decided in a draw. In this instance, ensemble methods like Gradient Boosting tree and Random Forest are preferable. The FIFA video game squad index offers additional details and makes a substantial contribution to “Goal Difference” prediction. The team that wins according to our prediction of the matches are Belgium. This match as indicated was played between Belgium and England. Due to the small amount of data and the ease with which a simple decision tree can offer a solution, other advanced machine learning models do not much differ from simple odd-based or strength-based trees.

Source link