Better than the best predictions for the World Cup 2022 (pt. 5)

Sijmen van der Willik
6 min readNov 25, 2022

--

How to use calibration to fit our target group better

So far we have been using international football matches regardless of the type of game or tournament. Now we wonder, can we adjust our predictions to be more suitable to our target group: World Cup games?

Yes, and we will do just that by calibrating our probability density functions to have a closer fit to World Cup games.

A short recap of the steps before:

  • Calculate probability density functions (pdf), blog #3
  • Clean data and calculate expected values, blog #4
World cup probability density function calibrator. Happy bright digital art [Dall E]

Calibration

With this step, we adjust the probability density functions (pdf) to more closely represent our target group: World Cup games.

Here is an example of a match from the World Cup 2018:

2018, July 10th, France vs. Belgium
France rating: 1874
Belgium rating: 1854
Rating difference: 20
Match outcome: 1–0

Note that the FIFA ratings above are the normalized ratings we created in blog #3.

Occurrences of six different match results for each rating difference as a density function after data cleaning. The red dotted line shows the rating difference between France and Belgium on July 10th, 2018. [image by author]

For this match the pdfs give the following values:

The outcome was 1–0, our functions tell us that had a 10.6% chance of happening.

How do we know if that is accurate?

We cannot know for one match, we know that it happened, but it could have had a 1% as well as a 50% chance of happening.

What we can do is look at a bunch more, and compare the aggregate.

For example, we take 100 matches with this rating difference, look at all the outcomes and compare them with the predicted chance. When we predict 10.6% for 100 matches, we would expect around 11 matches to end up 1–0.

If this number is much higher or much lower, we should adjust our probabilities.

Small example

Let’s try that with five matches so we can exactly see what happens.

We randomly select five matches from World Cup games after 1992:

For each of these matches, we look up the probabilities:

Occurrences of six different match results for each rating difference as a density function after data cleaning. The red dotted lines show the rating differences from the table above. [image by author]

We can then create a table with all the chances and the actual occurrences:

Total pdf indicates the sum of all the columns for that outcome. Occ indicates the actual occurrence of that outcome.

In the table above, we see the chances for each outcome for each rating difference. We are still using a very small sample, but we can already see the desired effect appearing: when we add more matches together, we get some idea about the actual chances and the predicted chances.

A bigger sample

Let’s take a hundred matches and create the same table:

The total expected number of occurrences and the actual number of occurrences for each outcome for a sample of one hundred matches.

For the sample above, we see that 1–0, 0–1, and 2–0 happened more than expected. The other values 0–0, 1–1, and 0–2 happened less than expected.

More samples

One hundred is still not a lot but we cannot increase the sample size much more since we only have 866 matches in total to sample from. Increasing the sample size would mean it is no longer really a sample so much as the entire dataset.

Instead, we will create many more samples and check out the differences for each sample. Let’s create a hundred samples and look at the values for 1–0.

Expected occurrence vs. actual occurrence. Each blue dot shows one sample of one hundred World Cup matches. The red line (y=x) shows what a perfect prediction would look like, the orange line shows the best linear fit for the data points.

The correlation is not zero but it is pretty close. It looks like if we sample from all matches, there is too much randomness in the sampling process for the pdf to counter it.

However, it is not really the correlation we are looking for. We want to know if the differences between predictions and actuals are centered around zero. In other words, we want to know if we predict too many about as often as we predict too few. The graph above does not show that clearly, instead, we create a bar chart with the occurrence of differences:

Bar chart showing the occurrence of differences between actual and predicted for each sample of one hundred samples.

Much easier to read!

It looks centered around zero, but not entirely. The mean of 0.865 also shows that on average, we predict too few games ending in 1–0.

Scale

We can now increase the number of samples and repeat for each possible outcome.

We increase the scale to reduce the impact of one unlikely sample. Here is the same chart again but with a thousand samples.

Bar chart showing the occurrence of differences between actual and predicted for each sample of one thousand samples.

Now that is a pretty distribution, except for that bar at -1 but we will allow it.

Let’s have a look at some other ones:

Bar chart showing the occurrence of differences between actual and predicted for each sample of one thousand samples. One bar chart shows one specific outcome. Note that not all outcomes are shown.

A mean below zero means it happens less than we predicted and a mean above zero means it happened more often than we predicted.

Calibrate

We can now use these values to calibrate our pdfs. Note that a shift with a single value will only move the pdfs up or down, it will not change the shape of the line.

The size of each sample is 100, meaning a mean of +1.1 means the pdf should be 1.1% higher. In other words, we move the line up a total of 0.011.

The original and calibrated version of the pdf for an outcome of 1–0.

We can now repeat the process for all other lines, below are the most common six.

The original and calibrated version of the pdf for six different outcomes. Original lines in grey, and calibrated lines in color.

The above graph shows the original lines in grey and the calibrated lines in color. The lines for 1–0 and 0–1 were already dominating the graph and have only slightly increased their lead, so for our predictions, not much will change.

In this example, we chose maximum calibration: we adjusted for the entire difference. Note that we can choose any number in between as well, if we shift less, we stay closer to the original distribution, if we shift more, we move closer to the World Cup distribution.

In most situations, you will find the optimum is somewhere in the middle.

Wrap up

In this blog, we calibrated our probability density functions by comparing them with a target group. We sampled from World Cup games after 1992 to see if these samples aligned with the probabilities. We found that most lines are already pretty close, but can be shifted a bit to get closer to our target group.

New predictions

As always, we will generate a new set of predictions.

We will use the same Expected Value calculations we used in blog 4. The only difference is the underlying pdfs, which are now calibrated to World Cup games.

Pred. EV: Predictions generated with pdf and expected value calculations.
Pred. cal. EV: Predictions generated with the calibrated pdf and expected value calculations.

For only one match the difference was enough to change the prediction, let’s hope it actually turns out to be 1–0.

The sum of our expected value increased from 206 to 207, so we should expect to get one more point.

Next up

Update on the performance versus other players here.

Predictions for the round of 16 can be found here.

Want to get my best predictions?

Follow me here on Medium and here on LinkedIn.

--

--