Human-Centric Chess Upset Prediction Using Mid-Game Data Mining Features
PlyShock is a Data Mining / Machine Learning project that predicts whether a chess game is likely to become an upset, where an upset is defined as a lower-rated player defeating a higher-rated opponent.
The project uses real Lichess PGN .zst data, extracts mid-game snapshots, evaluates positions with Stockfish, engineers rating/clock/engine/instability features, and compares classical machine learning models such as Decision Tree, KNN, Naive Bayes, SVM, and Random Forest.
Dynamic Upset Prediction in Chess Using Mid-Game Human-Centric Features
Chess results are strongly influenced by player ratings, but lower-rated players still defeat higher-rated opponents under certain dynamic game conditions. These upsets may be linked to mid-game factors such as:
- rating mismatch
- clock pressure
- engine evaluation
- positional instability
- evaluation swings
- interaction effects between rating, time, and position
PlyShock studies whether these mid-game signals can help predict upset outcomes more effectively than rating-only baselines.
This project is designed as a classical Data Mining project, not a deep learning chess engine.
Implemented models:
- Decision Tree
- K-Nearest Neighbors
- Naive Bayes
- Support Vector Machine
- Random Forest
The project focuses on interpretable tabular features and avoids CNN/deep-learning-based board modeling as the main implementation.
The dataset is built from public Lichess standard rated PGN dumps in .pgn.zst format.
The raw .pgn.zst file is not committed to this repository because of its large size.
A game is accepted only if it satisfies the project rules:
- decisive result only:
1-0or0-1 - valid
WhiteEloandBlackElo - rating gap of at least 100
- valid time control
- available clock comments
- reaches at least move 15
- parseable legal move sequence
The target label is:
upset_label = 1 if lower-rated player wins
upset_label = 0 if higher-rated player wins
Raw Lichess .pgn.zst
↓
Streaming PGN parser
↓
Game filtering + upset label creation
↓
Mid-game snapshot extraction
↓
Stockfish evaluation
↓
Feature engineering
↓
EDA + model training
↓
Evaluation + ablation study
For each accepted game, snapshots are extracted at:
Move 15, 20, 25, 30, 35
This creates multiple snapshot rows per game while preserving the same final upset label. The train/test split is grouped by game_id so that snapshots from the same game do not appear in both training and testing sets.
The final feature dataset contains 26 model input features grouped into the following families.
- rating gap
- lower-rated player color
- lower-rated player remaining time
- higher-rated player remaining time
- clock difference
- clock ratios
- time-pressure flags
- Stockfish centipawn evaluation
- evaluation from lower-rated player perspective
- absolute evaluation
- lower-rated player better/worse flag
- mate flag
- evaluation delta from previous snapshot
- trend from first snapshot
- volatility so far
- large evaluation swing flag
- rating gap × lower-rated evaluation
- time pressure × evaluation volatility
- lower-rated player worse while higher-rated player is under pressure
Leakage-prone fields such as result, winner color, final fullmove number, FEN, game ID, and time-control string are not used as model inputs.
The 50k parsed-game sample produced:
Filtered games: 6,537
Upset games: 2,012
Non-upset games: 4,525
Upset rate: 30.78%
Average rating gap: 224.50
Upset rate decreased as the rating gap increased:
| Rating Gap Bucket | Upset Rate |
|---|---|
| 100–199 | 36.46% |
| 200–399 | 24.47% |
| 400–799 | 14.19% |
| 800+ | 5.56% |
From the 6,537 filtered games, the snapshot builder produced:
Total snapshot rows: 24,563
Move 15 snapshots: 6,537
Move 20 snapshots: 5,947
Move 25 snapshots: 5,121
Move 30 snapshots: 4,031
Move 35 snapshots: 2,927
Every snapshot was evaluated using Stockfish at fixed depth 8.
Models were trained on 24,563 snapshot rows using a grouped train/test split by game_id.
Train rows: 19,648
Test rows: 4,915
Train games: 5,229
Test games: 1,308
Feature count: 26
| Model | Accuracy | Precision | Recall | F1 | ROC-AUC |
|---|---|---|---|---|---|
| Decision Tree | 0.698 | 0.542 | 0.500 | 0.520 | 0.647 |
| KNN | 0.729 | 0.607 | 0.477 | 0.535 | 0.729 |
| Naive Bayes | 0.707 | 0.559 | 0.482 | 0.517 | 0.727 |
| SVM | 0.724 | 0.559 | 0.733 | 0.634 | 0.810 |
| Random Forest | 0.762 | 0.702 | 0.472 | 0.564 | 0.813 |
Random Forest achieved the highest accuracy, precision, and ROC-AUC, while SVM achieved the best recall and F1-score. This suggests that Random Forest is more conservative when predicting upsets, while SVM is better at catching a larger proportion of actual upset cases.
The ablation study compares the full feature set against simpler baselines.
| Feature Set | Accuracy | F1 | ROC-AUC |
|---|---|---|---|
| Majority Baseline | 0.673 | 0.000 | N/A |
| Rating Only | 0.552 | 0.463 | 0.586 |
| Rating + Clock | 0.696 | 0.408 | 0.698 |
| Rating + Engine | 0.692 | 0.446 | 0.699 |
| Rating + Clock + Engine | 0.762 | 0.574 | 0.811 |
| Full PlyShock | 0.762 | 0.564 | 0.813 |
The strongest performance jump comes from combining rating, clock, and engine-evaluation features. This supports the central project idea that upset prediction benefits from mid-game context rather than rating alone.
PlyShock/
|-- app/
| |-- backend/
| `-- frontend/
|-- docs/
| `-- research/
| `-- literature-notes.md
|-- research/
| |-- artifacts/
| | |-- metrics/
| | `-- plots/
| |-- data/
| | |-- raw/
| | |-- interim/
| | |-- processed/
| | `-- samples/
| |-- src/
| | `-- plyshock/
| | |-- engine/
| | |-- evaluation/
| | |-- features/
| | |-- parsing/
| | |-- pipelines/
| | `-- training/
| `-- tests/
|-- docker-compose.yml
|-- pyproject.toml
|-- uv.lock
`-- README.md
- Python 3.11
- uv
- pandas
- NumPy
- scikit-learn
- python-chess
- zstandard
- Stockfish
- matplotlib
- pytest
- Ruff
Install dependencies:
uv syncRun tests:
uv run pytestRun linting:
uv run ruff check research/src research/testsStart the backend from the repository root:
uv run uvicorn app.backend.main:app --reloadStart the frontend:
cd app/frontend
pnpm devOpen:
- Frontend: http://localhost:3000
- Backend docs: http://127.0.0.1:8000/docs
The frontend uses NEXT_PUBLIC_API_BASE_URL and falls back to http://127.0.0.1:8000.
Build and run the full app:
docker compose build
docker compose upThen open:
- Frontend: http://localhost:3000
- Backend health: http://localhost:8000/health
- Backend docs: http://localhost:8000/docs
Docker uses Linux Stockfish installed inside the backend image. Local Windows development still
uses research/tools/stockfish/stockfish.exe when STOCKFISH_PATH is not set.
Before running Docker, make sure these local artifacts exist:
research/artifacts/models/best_model.joblibresearch/data/processed/feature_schema_50k.jsonresearch/artifacts/metrics/model_comparison.json
The Compose setup mounts model, schema, and metrics artifacts as read-only volumes. Raw Lichess dumps, interim datasets, trained model binaries, and local Stockfish binaries are not committed and are not baked into Docker images.
Build filtered games:
uv run python -m plyshock.pipelines.build_filtered_games \
--input research/data/raw/lichess_db_standard_rated_2026-03.pgn.zst \
--output research/data/interim/filtered_games_50k.parquet \
--summary research/artifacts/reports/filter_summary_50k.json \
--max-games 50000Build snapshots:
uv run python -m plyshock.pipelines.build_snapshots \
--input research/data/interim/filtered_games_50k.parquet \
--output research/data/interim/snapshots_50k.parquet \
--summary research/artifacts/reports/snapshot_summary_50k.jsonRun engine evaluation:
uv run python -m plyshock.pipelines.run_engine_eval \
--snapshots research/data/interim/snapshots_50k.parquet \
--cache research/data/interim/eval_cache_50k.parquet \
--output research/data/interim/snapshots_with_eval_50k.parquet \
--engine-path research/tools/stockfish/stockfish.exe \
--depth 8Build features:
uv run python -m plyshock.pipelines.build_features \
--input research/data/interim/snapshots_with_eval_50k.parquet \
--output research/data/processed/plyshock_features_50k.parquet \
--schema research/data/processed/feature_schema_50k.json \
--summary research/artifacts/reports/feature_summary_50k.jsonTrain models:
uv run python -m plyshock.pipelines.train_all_models \
--features research/data/processed/plyshock_features_50k.parquet \
--schema research/data/processed/feature_schema_50k.json \
--models-dir research/artifacts/models \
--metrics-dir research/artifacts/metrics \
--plots-dir research/artifacts/plotsRun ablation study:
uv run python -m plyshock.pipelines.run_ablation \
--features research/data/processed/plyshock_features_50k.parquet \
--schema research/data/processed/feature_schema_50k.json \
--output-json research/artifacts/metrics/ablation_results_50k.json \
--output-plot research/artifacts/plots/ablation_f1_50k.pngThe following files are intentionally not committed:
- raw Lichess
.pgn.zstdumps - interim parquet datasets
- processed parquet datasets
- Stockfish binary
- trained
.joblibmodels - local virtual environments
- local editor settings
This keeps the repository lightweight and reproducible.
The research pipeline and local demo app are complete through:
Parser ✅
Filtering ✅
EDA ✅
Snapshot extraction ✅
Stockfish evaluation ✅
Feature engineering ✅
Model training ✅
Ablation study ✅
Evaluation plots ✅
FastAPI backend ✅
Next.js frontend ✅
Docker packaging ✅
Demo replay mode ✅
Paste/custom PGN mode ✅
Live board mode ✅
Planned next stages:
Final report writing
Presentation / viva preparation
Optional cloud deployment
The reported results are based on a sampled subset of Lichess games and should be interpreted as project-level experimental findings, not universal claims about all chess games. Further validation on additional months, larger samples, and different time controls would strengthen generalization.
Created by FuriousFire
This project is licensed under the MIT License.







