Better than the best predictions for the World Cup 2022 (pt. 5)
How to use calibration to fit our target group better
So far we have been using international football matches regardless of the type of game or tournament. Now we wonder, can we adjust our predictions to be more suitable to our target group: World Cup games?
Yes, and we will do just that by calibrating our probability density functions to have a closer fit to World Cup games.
A short recap of the steps before:
- Calculate probability density functions (pdf), blog #3
- Clean data and calculate expected values, blog #4
Calibration
With this step, we adjust the probability density functions (pdf) to more closely represent our target group: World Cup games.
Here is an example of a match from the World Cup 2018:
2018, July 10th, France vs. Belgium
France rating: 1874
Belgium rating: 1854
Rating difference: 20
Match outcome: 1–0
Note that the FIFA ratings above are the normalized ratings we created in blog #3.
For this match the pdfs give the following values:
The outcome was 1–0, our functions tell us that had a 10.6% chance of happening.
How do we know if that is accurate?
We cannot know for one match, we know that it happened, but it could have had a 1% as well as a 50% chance of happening.
What we can do is look at a bunch more, and compare the aggregate.
For example, we take 100 matches with this rating difference, look at all the outcomes and compare them with the predicted chance. When we predict 10.6% for 100 matches, we would expect around 11 matches to end up 1–0.
If this number is much higher or much lower, we should adjust our probabilities.
Small example
Let’s try that with five matches so we can exactly see what happens.
We randomly select five matches from World Cup games after 1992:
For each of these matches, we look up the probabilities:
We can then create a table with all the chances and the actual occurrences:
In the table above, we see the chances for each outcome for each rating difference. We are still using a very small sample, but we can already see the desired effect appearing: when we add more matches together, we get some idea about the actual chances and the predicted chances.
A bigger sample
Let’s take a hundred matches and create the same table:
For the sample above, we see that 1–0, 0–1, and 2–0 happened more than expected. The other values 0–0, 1–1, and 0–2 happened less than expected.
More samples
One hundred is still not a lot but we cannot increase the sample size much more since we only have 866 matches in total to sample from. Increasing the sample size would mean it is no longer really a sample so much as the entire dataset.
Instead, we will create many more samples and check out the differences for each sample. Let’s create a hundred samples and look at the values for 1–0.
The correlation is not zero but it is pretty close. It looks like if we sample from all matches, there is too much randomness in the sampling process for the pdf to counter it.
However, it is not really the correlation we are looking for. We want to know if the differences between predictions and actuals are centered around zero. In other words, we want to know if we predict too many about as often as we predict too few. The graph above does not show that clearly, instead, we create a bar chart with the occurrence of differences:
Much easier to read!
It looks centered around zero, but not entirely. The mean of 0.865 also shows that on average, we predict too few games ending in 1–0.
Scale
We can now increase the number of samples and repeat for each possible outcome.
We increase the scale to reduce the impact of one unlikely sample. Here is the same chart again but with a thousand samples.
Now that is a pretty distribution, except for that bar at -1 but we will allow it.
Let’s have a look at some other ones:
A mean below zero means it happens less than we predicted and a mean above zero means it happened more often than we predicted.
Calibrate
We can now use these values to calibrate our pdfs. Note that a shift with a single value will only move the pdfs up or down, it will not change the shape of the line.
The size of each sample is 100, meaning a mean of +1.1 means the pdf should be 1.1% higher. In other words, we move the line up a total of 0.011.
We can now repeat the process for all other lines, below are the most common six.
The above graph shows the original lines in grey and the calibrated lines in color. The lines for 1–0 and 0–1 were already dominating the graph and have only slightly increased their lead, so for our predictions, not much will change.
In this example, we chose maximum calibration: we adjusted for the entire difference. Note that we can choose any number in between as well, if we shift less, we stay closer to the original distribution, if we shift more, we move closer to the World Cup distribution.
In most situations, you will find the optimum is somewhere in the middle.
Wrap up
In this blog, we calibrated our probability density functions by comparing them with a target group. We sampled from World Cup games after 1992 to see if these samples aligned with the probabilities. We found that most lines are already pretty close, but can be shifted a bit to get closer to our target group.
New predictions
As always, we will generate a new set of predictions.
We will use the same Expected Value calculations we used in blog 4. The only difference is the underlying pdfs, which are now calibrated to World Cup games.
Pred. EV: Predictions generated with pdf and expected value calculations.
Pred. cal. EV: Predictions generated with the calibrated pdf and expected value calculations.
For only one match the difference was enough to change the prediction, let’s hope it actually turns out to be 1–0.
The sum of our expected value increased from 206 to 207, so we should expect to get one more point.