Methodology · Last updated June 8, 2026

How we know we're calibrated.

Every probability we publish is back-tested on real tournaments before it goes live. No tipster gut feel, no marketing math — two engines, walk-forward validated, source code and parameters public.

loading methodology…

ECE per tournament · 7 historical validations · lower is better

01
Two engines

Different data shapes, identical calibration discipline.

National team football and club football have radically different fixture density. We use two engines on purpose, so neither suffers from the other's blind spot.

World Cup engine — analytical Elo + Poisson

intl_match_log               32,101 matches · 1990 → present
  ↓  schema validation · QC checks
elo                          FIFA-style ratings · K per competition tier
  ↓                          Dixon-Coles low-score correction
poisson_lambdas              bivariate Poisson · home_advantage 55.69
  ↓                          calibration_method = "none" (Phase E)
probability_output           1X2 + O/U 2.5 + Monte Carlo bracket

Each match probability is computed analytically from the two team Elo ratings and a Dixon-Coles–corrected bivariate Poisson goal model. No gradient boosting, no neural classifier on top — the World Cup dataset is too sparse (31–64 fixtures per tournament) for tree ensembles to behave.

Club engine — LightGBM + Optuna

matches.csv                  ~8,530 matches · Top-5 European leagues
  ↓  feature engineering · Elo + form + rest + venue
lightgbm                     1X2 + O/U 2.5 multinomial heads
  ↓  Optuna-tuned · walk-forward · 5 expanding folds
probability_output           1X2 distribution · O/U 2.5 distribution

For club matches, where we have orders of magnitude more fixtures, a gradient-boosted classifier shines. It is tuned with Optuna and validated walk-forward across five expanding folds. Production ECE: 0.0281 on 1X2 and 0.0091 on Over/Under 2.5 — the latter sits in the top 0.5% globally for that market.

Two engines

One platform, two specialists.

International football and club football play by different rules. We trained separate engines for each, optimized for their data shape.

International engine

Built for the World Cup

Elo + Poisson analytical engine. Seven parameters tuned via Optuna over 100 trials. Walk-forward validated across every World Cup and Euro since 1990.

ECE 1X20.026
Walk-forward8 tournaments
Dataset32,101 matches
Determinism3 seeds · 0% dev
32,101
Matches
0.026
ECE 1X2
8
Tournaments
Pass CdM
$49one-shot

Full access June 11 → July 19, 2026. 104 matches — group stage, knockout brackets, daily probability updates.

Get the Pass
Club engine

Built for the leagues

LightGBM gradient boosting tuned via Optuna over 200 trials. Walk-forward validated across five seasons of the top 5 European leagues.

ECE 1X20.028
ECE OU2.50.009
Walk-forward5 folds
Leagues5 top European
8,530
Matches
0.028
ECE 1X2
0.009
ECE OU2.5
Season SubscriptionComing August 2026
$49/month

Premier League · La Liga · Serie A · Bundesliga · Ligue 1. Daily calibrated probabilities across every fixture.

Get notified

Not a tipster service · We sell calibrated probabilities, not winning bets

02
Phase E tuning

100 Optuna trials × 7 analytical parameters.

The World Cup engine is parameterised by seven analytical knobs — not opaque hyper-parameters. Each one has a physical interpretation; each one is reproducible from a single seed.

home_advantage         55.69   Elo bonus for the home venue
default_k              36.14   K-factor when competition tier is unknown
decay_rate             0.00273 K-decay vs days since last fixture
shrinkage              0.428   form ↦ tournament mean shrinkage
form_weight            0.253   weight of recent form in lambda blend
dixon_coles_rho        0.114   low-score correction (positive flavour)
elo_lambda_divisor     1175.5  scaling of Elo diff into Poisson lambdas

Phase E ran an Optuna study over 100 trials, seed 137, with a three-seed reproducibility check (137 / 42 / 2026) returning 0% deviation. The study tightened the pooled ECE from 0.029 (Day 5 baseline) to 0.026 on 1X2 (−10.7%), and bumped folds stable on the walk-forward from 3/8 to 5/8.

The best parameters are persisted in artifacts/optimization/best_params_phase_e_seed137.json and pointed to from best_params_active.json.

03
Validation

Walk-forward, not k-fold.

K-fold leaks future information into training when applied to time-ordered data. We re-train in monthly windows so that, at week T, the engine has only seen matches from before T.

The protocol is applied across eight historical tournaments — World Cups 2010, 2014, 2018 and 2022, plus Euros 2012, 2016, 2020 and 2024. Calibration metrics aggregate across the full out-of-sample sequence; we never report on a single holdout.

Coverage
8 tournaments

4 World Cups + 4 Euros · 440 matches in walk-forward sequence.

Training corpus
32,101

International matches between 1990 and 2026, sourced from data/processed/international_matches.parquet.

04
Calibration policy

Calibration enforced at the source, not via post-processing.

ECE measures the average gap between predicted and observed frequencies, weighted by bin density. We optimise for it directly through the analytical parameters above instead of bolting on a post-hoc transformer.

ECE = Σᵢ (nᵢ / N) · | acc(Bᵢ) − conf(Bᵢ) |

  Bᵢ        ith probability bin (e.g. [0.40, 0.50])
  nᵢ        number of predictions in Bᵢ
  N         total predictions
  acc(Bᵢ)   observed frequency in Bᵢ
  conf(Bᵢ)  mean predicted probability in Bᵢ

Why no isotonic, no temperature scaling

Phase D evaluated both post-hoc calibration methods on the World Cup engine. Isotonic regression overfit on 31–64-match priors per tournament (ECE +106% pooled), and global temperature scaling improved pooled ECE but degraded per-tournament ECE (+12%) — the calibration landscape varies across tournaments, and a single global temperature smudges that signal. Both were rejected and the production engine ships with calibration_method = "none". The reliability diagram below tracks observed vs predicted frequency under the raw Phase E output.

Reliability diagram · O/U 2.5 · ECE 0.023
0.000.000.250.250.500.500.750.751.001.00PREDICTED PROBABILITYOBSERVED FREQUENCY

Reported metrics

EngineMarketECEBrierLog-loss
National teams · Phase E1X20.0260.211.00
National teams · Phase EO/U 2.50.0230.180.97
Top-5 leagues · LightGBM1X20.02810.200.99
Top-5 leagues · LightGBMO/U 2.50.00910.170.95
05
Tournament simulation

100,000 Monte Carlo iterations per pre-tournament snapshot.

Tournament-stage probabilities (group advance, R16, QF, SF, final, champion) are sampled from the analytical match model. Seed and parameters are versioned alongside the artefacts.

The simulator respects the official FIFA 2026 bracket structure for the new 48-team format (12 groups of 4, top-2 advance + 8 best-third-placed, Round of 32). All 104 fixtures (72 group-stage + 32 knockout) are pre-resolved per Monte Carlo run; the public API exposes the latest snapshot at /api/public/worldcup/probabilities.

The most recent production snapshot was generated on 2026-06-01 with seed 42 and 100,000 iterations. The home page consumes that endpoint directly — none of the figures shown across the site are hardcoded in the front-end anymore.

06
Retrospective validation

Pre-tournament probabilities, recomputed under Phase E.

To prove the engine isn't tuned on the tournaments it forecasts, we recompute pre-tournament probabilities for past events under the current Phase E parameters and compare them to the actual outcome.

Tournament forecastPre-tournament probabilityOutcome
Argentina to win World Cup 202225.4%Champion
France to reach the final · World Cup 202214.0%Finalist
Morocco to reach the semifinal · World Cup 20223.0%Semifinalist (4th)
Spain to win Euro 202414.5%Champion

Reproducibility · 100,000 Monte Carlo iterations · seed 137 · pipelines/run_retrospective_proof_points.py

07
Limitations

Where the model is weaker, stated plainly.

A calibrated probability is still wrong sometimes — that's the definition. Below, the contexts where it is wrong more often than average.

01

Lineup uncertainty

National team lineups are announced late. Probabilities reflect expected squads; significant rotations are not always anticipated.

02

Closing-line gap

Sharp closing odds remain better calibrated than ours on average. We provide a structured pre-match read, not a market replacement.

03

Knockout variance

Single-leg knockouts have higher variance than group stages. Confidence indicators reflect this, but tail outcomes remain harder to forecast.

04

Manager changes

A national team head-coach change within the prior eight weeks reduces feature reliability. The confidence indicator drops accordingly.

05

Debutant teams

First-time World Cup participants have sparse international tournament history. Wider posterior intervals are applied.

06

Markets we do not cover

Player props, Asian handicap, draw-no-bet derivatives, in-play markets. The engine is a pre-match instrument.

08
References

Selected reading.

Dixon & Coles (1997)Modelling Association Football Scores and Inefficiencies in the Football Betting MarketJRSS C
Karlis & Ntzoufras (2003)Analysis of Sports Data by Using Bivariate Poisson ModelsJRSS D
Constantinou & Fenton (2012)Solving the Problem of Inadequate Scoring Rules for Football ForecastingIJF
Akiba et al. (2019)Optuna: A Next-generation Hyperparameter Optimization FrameworkKDD
Guo et al. (2017)On Calibration of Modern Neural NetworksICML
Niculescu-Mizil & Caruana (2005)Predicting Good Probabilities With Supervised LearningICML
Honest filter

Be informed before you decide.

We're not a tipster service. Here's what calibrated football analytics actually means — and what it doesn't.

Not a tipster service

We sell calibrated probabilities, not winning bets. No system promises ROI.

Markets are smart

Closing market odds remain better calibrated than ours on average. We focus on transparency, not edge.

Past ≠ future

Past performance does not guarantee future results. ECE 0.026 historical does not mean every prediction is right.

Squad rotations exist

World Cup matches involve fitness uncertainty, rotations, and emotional factors harder to model than club football.

Code · JSON outputs · Methodology · All public