Engine architecture, validation protocols, and limitations.
Five deterministic stages, end-to-end auditable.
Every probability traces back to a feature vector and a calibrated head. No ensembles, no opaque rerankers.
Pipeline
intl_match_log
↓ schema validation · QC checks · 14 source feeds
elo_poisson
↓ Elo ratings + bivariate Poisson goal model
lgbm-prematch-v1
↓ LightGBM classifier · Optuna-tuned · 5-fold CV
temperature_scaling
↓ post-hoc calibration on held-out validation set
probability_output
1X2 distribution · O/U 2.5 distributionStage 1 — intl_match_log
Match-level events, squad selections, and venue data are ingested from multiple source feeds. A schema validator rejects partial records; a quality-control layer ensures temporal integrity of all features.
Stage 2 — elo_poisson
FIFA-style Elo ratings with variable K-factors per competition tier, combined with a bivariate Poisson model (Dixon-Coles correction for low-scoring matches). Trained on 32,101 international matches from 1990 to present.
Stage 3 — lgbm-prematch-v1
LightGBM gradient-boosted classifier with multinomial output. Hyperparameters tuned via Optuna over 200 trials with regularization priors (max_depth ≤ 6, min_child_samples ≥ 80) to suppress over-fitting on the long tail of low-frequency feature combinations.
Stage 4 — temperature_scaling
Logits divided by a learned temperature T, fitted by minimizing NLL on a held-out calibration set disjoint from training. Temperature scaling preserves rank ordering while sharpening or softening the probability distribution to match observed frequencies.
Stage 5 — probability_output
Final 1X2 distribution and Over/Under 2.5 distribution. Confidence indicator derived from feature density and historical calibration of the matched scenario cluster.
ECE — Expected Calibration Error.
A probabilistic forecaster is well-calibrated if outcomes labelled at p occur with frequency p. ECE measures the average gap, weighted by bin frequency.
ECE = Σᵢ (nᵢ / N) · | acc(Bᵢ) − conf(Bᵢ) | Bᵢ ith probability bin (e.g. [0.40, 0.50]) nᵢ number of predictions in Bᵢ N total predictions acc(Bᵢ) observed frequency in Bᵢ conf(Bᵢ) mean predicted probability in Bᵢ
An ECE of 0.023 on Over/Under 2.5 means our probabilities deviate from observed frequencies by 2.3 percentage points on average, weighted by bin density. The reliability diagram below visualizes the same quantity: each point is a bin, the diagonal is perfect calibration.
Reported metrics
Walk-forward, not k-fold.
K-fold leaks future information into training when applied to time-ordered data. Walk-forward enforces strict temporal separation: at week T, the model has only seen matches before T.
Our protocol re-trains in monthly windows. Each prediction is evaluated against an outcome the model could not have observed at training time. Calibration metrics are aggregated across the full out-of-sample sequence — never on a single holdout set.
Coverage
WC 2010, 2014, 2018, 2022 + Euros 2012, 2016, 2020, 2024. 32,101 international matches in training corpus.
International matches from 1990 to present. Elo + Poisson model with walk-forward validation on 8 past tournaments.
worldcup-2026-engine-v1.
National-team football has fewer matches per side, longer gaps, and squad rotations that destabilize club-level features. Our World Cup engine uses an Elo-style rating updated bilaterally after each international fixture, combined with a bivariate Poisson goal model conditioned on team strength differential and venue.
Specifications
model worldcup-2026-engine-v1 approach Elo (FIFA-style) + bivariate Poisson training data 32,101 international matches · 1990–2026 validation walk-forward · 8 past tournaments calibration ECE 0.026 (1X2) · ECE 0.023 (O/U 2.5) qualification Monte Carlo · 100,000 simulations update frequency after every match (during tournament)
Retrospective validation
These are pre-tournament forecasts. Once the tournament begins, probabilities update after every match.
Where the model is weaker, stated plainly.
A calibrated probability is still wrong sometimes — that is the definition. Below, the contexts where it is wrong more often than average.
Lineup uncertainty
National team lineups are announced late. Probabilities reflect expected squads; significant rotations are not always anticipated.
Closing-line gap
Sharp closing odds remain better calibrated than ours on average. We provide a structured pre-match read, not a market replacement.
Knockout formats
Single-leg knockout matches have higher variance than group stages. Confidence indicators reflect this, but tail outcomes remain harder to forecast.
Manager changes
A change in national team head coach within the prior 8 weeks reduces feature reliability. The confidence indicator drops accordingly.
Debutant teams
First-time World Cup participants have sparse international tournament history. Wider posterior intervals applied.
Markets we do not cover
Player-level props, asian handicap, draw-no-bet derivatives, in-play markets. The engine is a pre-match instrument.