PlyShock

Human-Centric Chess Upset Prediction Using Mid-Game Data Mining Features

PlyShock is a Data Mining / Machine Learning project that predicts whether a chess game is likely to become an upset, where an upset is defined as a lower-rated player defeating a higher-rated opponent.

The project uses real Lichess PGN .zst data, extracts mid-game snapshots, evaluates positions with Stockfish, engineers rating/clock/engine/instability features, and compares classical machine learning models such as Decision Tree, KNN, Naive Bayes, SVM, and Random Forest.

Academic Title

Dynamic Upset Prediction in Chess Using Mid-Game Human-Centric Features

Problem Statement

Chess results are strongly influenced by player ratings, but lower-rated players still defeat higher-rated opponents under certain dynamic game conditions. These upsets may be linked to mid-game factors such as:

rating mismatch
clock pressure
engine evaluation
positional instability
evaluation swings
interaction effects between rating, time, and position

PlyShock studies whether these mid-game signals can help predict upset outcomes more effectively than rating-only baselines.

Project Scope

This project is designed as a classical Data Mining project, not a deep learning chess engine.

Implemented models:

Decision Tree
K-Nearest Neighbors
Naive Bayes
Support Vector Machine
Random Forest

The project focuses on interpretable tabular features and avoids CNN/deep-learning-based board modeling as the main implementation.

Dataset

The dataset is built from public Lichess standard rated PGN dumps in .pgn.zst format.

The raw .pgn.zst file is not committed to this repository because of its large size.

Filtering Criteria

A game is accepted only if it satisfies the project rules:

decisive result only: 1-0 or 0-1
valid WhiteElo and BlackElo
rating gap of at least 100
valid time control
available clock comments
reaches at least move 15
parseable legal move sequence

The target label is:

upset_label = 1 if lower-rated player wins
upset_label = 0 if higher-rated player wins

Pipeline Overview

Raw Lichess .pgn.zst
        ↓
Streaming PGN parser
        ↓
Game filtering + upset label creation
        ↓
Mid-game snapshot extraction
        ↓
Stockfish evaluation
        ↓
Feature engineering
        ↓
EDA + model training
        ↓
Evaluation + ablation study

Mid-Game Snapshot Strategy

For each accepted game, snapshots are extracted at:

Move 15, 20, 25, 30, 35

This creates multiple snapshot rows per game while preserving the same final upset label. The train/test split is grouped by game_id so that snapshots from the same game do not appear in both training and testing sets.

Feature Families

The final feature dataset contains 26 model input features grouped into the following families.

Rating Features

rating gap
lower-rated player color

Clock Features

lower-rated player remaining time
higher-rated player remaining time
clock difference
clock ratios
time-pressure flags

Engine Evaluation Features

Stockfish centipawn evaluation
evaluation from lower-rated player perspective
absolute evaluation
lower-rated player better/worse flag
mate flag

Instability Features

evaluation delta from previous snapshot
trend from first snapshot
volatility so far
large evaluation swing flag

Interaction Features

rating gap × lower-rated evaluation
time pressure × evaluation volatility
lower-rated player worse while higher-rated player is under pressure

Leakage-prone fields such as result, winner color, final fullmove number, FEN, game ID, and time-control string are not used as model inputs.

EDA Results

The 50k parsed-game sample produced:

Filtered games: 6,537
Upset games: 2,012
Non-upset games: 4,525
Upset rate: 30.78%
Average rating gap: 224.50

Upset rate decreased as the rating gap increased:

Rating Gap Bucket	Upset Rate
100–199	36.46%
200–399	24.47%
400–799	14.19%
800+	5.56%

EDA Plots

Snapshot Dataset

From the 6,537 filtered games, the snapshot builder produced:

Total snapshot rows: 24,563
Move 15 snapshots: 6,537
Move 20 snapshots: 5,947
Move 25 snapshots: 5,121
Move 30 snapshots: 4,031
Move 35 snapshots: 2,927

Every snapshot was evaluated using Stockfish at fixed depth 8.

Model Training Results

Models were trained on 24,563 snapshot rows using a grouped train/test split by game_id.

Train rows: 19,648
Test rows: 4,915
Train games: 5,229
Test games: 1,308
Feature count: 26

Model	Accuracy	Precision	Recall	F1	ROC-AUC
Decision Tree	0.698	0.542	0.500	0.520	0.647
KNN	0.729	0.607	0.477	0.535	0.729
Naive Bayes	0.707	0.559	0.482	0.517	0.727
SVM	0.724	0.559	0.733	0.634	0.810
Random Forest	0.762	0.702	0.472	0.564	0.813

Interpretation

Random Forest achieved the highest accuracy, precision, and ROC-AUC, while SVM achieved the best recall and F1-score. This suggests that Random Forest is more conservative when predicting upsets, while SVM is better at catching a larger proportion of actual upset cases.

Ablation Study

The ablation study compares the full feature set against simpler baselines.

Feature Set	Accuracy	F1	ROC-AUC
Majority Baseline	0.673	0.000	N/A
Rating Only	0.552	0.463	0.586
Rating + Clock	0.696	0.408	0.698
Rating + Engine	0.692	0.446	0.699
Rating + Clock + Engine	0.762	0.574	0.811
Full PlyShock	0.762	0.564	0.813

The strongest performance jump comes from combining rating, clock, and engine-evaluation features. This supports the central project idea that upset prediction benefits from mid-game context rather than rating alone.

Evaluation Plots

Random Forest Confusion Matrix

SVM Confusion Matrix

Random Forest Feature Importance

Repository Structure

PlyShock/
|-- app/
|   |-- backend/
|   `-- frontend/
|-- docs/
|   `-- research/
|       `-- literature-notes.md
|-- research/
|   |-- artifacts/
|   |   |-- metrics/
|   |   `-- plots/
|   |-- data/
|   |   |-- raw/
|   |   |-- interim/
|   |   |-- processed/
|   |   `-- samples/
|   |-- src/
|   |   `-- plyshock/
|   |       |-- engine/
|   |       |-- evaluation/
|   |       |-- features/
|   |       |-- parsing/
|   |       |-- pipelines/
|   |       `-- training/
|   `-- tests/
|-- docker-compose.yml
|-- pyproject.toml
|-- uv.lock
`-- README.md

Tech Stack

Python 3.11
uv
pandas
NumPy
scikit-learn
python-chess
zstandard
Stockfish
matplotlib
pytest
Ruff

Running the Project

Install dependencies:

uv sync

Run tests:

uv run pytest

Run linting:

uv run ruff check research/src research/tests

Running the Demo App

Start the backend from the repository root:

uv run uvicorn app.backend.main:app --reload

Start the frontend:

cd app/frontend
pnpm dev

Open:

Frontend: http://localhost:3000
Backend docs: http://127.0.0.1:8000/docs

The frontend uses NEXT_PUBLIC_API_BASE_URL and falls back to http://127.0.0.1:8000.

Docker Compose

Build and run the full app:

docker compose build
docker compose up

Then open:

Frontend: http://localhost:3000
Backend health: http://localhost:8000/health
Backend docs: http://localhost:8000/docs

Docker uses Linux Stockfish installed inside the backend image. Local Windows development still uses research/tools/stockfish/stockfish.exe when STOCKFISH_PATH is not set.

Before running Docker, make sure these local artifacts exist:

research/artifacts/models/best_model.joblib
research/data/processed/feature_schema_50k.json
research/artifacts/metrics/model_comparison.json

The Compose setup mounts model, schema, and metrics artifacts as read-only volumes. Raw Lichess dumps, interim datasets, trained model binaries, and local Stockfish binaries are not committed and are not baked into Docker images.

Example Pipeline Commands

Build filtered games:

uv run python -m plyshock.pipelines.build_filtered_games \
  --input research/data/raw/lichess_db_standard_rated_2026-03.pgn.zst \
  --output research/data/interim/filtered_games_50k.parquet \
  --summary research/artifacts/reports/filter_summary_50k.json \
  --max-games 50000

Build snapshots:

uv run python -m plyshock.pipelines.build_snapshots \
  --input research/data/interim/filtered_games_50k.parquet \
  --output research/data/interim/snapshots_50k.parquet \
  --summary research/artifacts/reports/snapshot_summary_50k.json

Run engine evaluation:

uv run python -m plyshock.pipelines.run_engine_eval \
  --snapshots research/data/interim/snapshots_50k.parquet \
  --cache research/data/interim/eval_cache_50k.parquet \
  --output research/data/interim/snapshots_with_eval_50k.parquet \
  --engine-path research/tools/stockfish/stockfish.exe \
  --depth 8

Build features:

uv run python -m plyshock.pipelines.build_features \
  --input research/data/interim/snapshots_with_eval_50k.parquet \
  --output research/data/processed/plyshock_features_50k.parquet \
  --schema research/data/processed/feature_schema_50k.json \
  --summary research/artifacts/reports/feature_summary_50k.json

Train models:

uv run python -m plyshock.pipelines.train_all_models \
  --features research/data/processed/plyshock_features_50k.parquet \
  --schema research/data/processed/feature_schema_50k.json \
  --models-dir research/artifacts/models \
  --metrics-dir research/artifacts/metrics \
  --plots-dir research/artifacts/plots

Run ablation study:

uv run python -m plyshock.pipelines.run_ablation \
  --features research/data/processed/plyshock_features_50k.parquet \
  --schema research/data/processed/feature_schema_50k.json \
  --output-json research/artifacts/metrics/ablation_results_50k.json \
  --output-plot research/artifacts/plots/ablation_f1_50k.png

Notes on Ignored Files

The following files are intentionally not committed:

raw Lichess .pgn.zst dumps
interim parquet datasets
processed parquet datasets
Stockfish binary
trained .joblib models
local virtual environments
local editor settings

This keeps the repository lightweight and reproducible.

Current Status

The research pipeline and local demo app are complete through:

Parser ✅
Filtering ✅
EDA ✅
Snapshot extraction ✅
Stockfish evaluation ✅
Feature engineering ✅
Model training ✅
Ablation study ✅
Evaluation plots ✅
FastAPI backend ✅
Next.js frontend ✅
Docker packaging ✅
Demo replay mode ✅
Paste/custom PGN mode ✅
Live board mode ✅

Planned next stages:

Final report writing
Presentation / viva preparation
Optional cloud deployment

Disclaimer

The reported results are based on a sampled subset of Lichess games and should be interpreted as project-level experimental findings, not universal claims about all chess games. Further validation on additional months, larger samples, and different time controls would strengthen generalization.

Author

Created by FuriousFire

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
app		app
docs/research		docs/research
research		research
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

PlyShock

Academic Title

Problem Statement

Project Scope

Dataset

Filtering Criteria

Pipeline Overview

Mid-Game Snapshot Strategy

Feature Families

Rating Features

Clock Features

Engine Evaluation Features

Instability Features

Interaction Features

EDA Results

EDA Plots

Snapshot Dataset

Model Training Results

Interpretation

Ablation Study

Evaluation Plots

Random Forest Confusion Matrix

SVM Confusion Matrix

Random Forest Feature Importance

Repository Structure

Tech Stack

Running the Project

Running the Demo App

Docker Compose

Example Pipeline Commands

Notes on Ignored Files

Current Status

Disclaimer

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages