Skip to content

MRB-648 Scorecards for evalml#145

Merged
dnerini merged 32 commits into
mainfrom
MRB-648-Scorecards-for-evalml
May 28, 2026
Merged

MRB-648 Scorecards for evalml#145
dnerini merged 32 commits into
mainfrom
MRB-648-Scorecards-for-evalml

Conversation

@adestefani8
Copy link
Copy Markdown
Collaborator

@adestefani8 adestefani8 commented May 1, 2026

What this PR adds

This PR adds a new report_scorecard rule that renders a PNG comparing one run against one baseline.

The scorecard has:

  • one row per (variable × metric)
  • one column block per region
  • one column per lead time inside each region block

Each cell encodes the row's metric as a relative difference between the two runs for that region and lead time: (model − baseline) / |baseline| × 100.

Markers:

  • blue → model better
  • red → baseline better
  • grey → |diff| below the neutral threshold (default 5%)
  • grey x → missing or non-finite value

Scores:

  • Supported scores: RMSE, MAE, STDE, R2, ETS, POD, FAR
  • Score direction: RMSE, MAE, STDE, and FAR are lower-is-better; R2, ETS, and POD are higher-is-better.

Above the neutral threshold, dot area scales linearly with |diff|% and caps at size_cap_pct (default 30%).

scorecard_ICON-CH2-EPS_demo2

Configuration

Configurable via params on the rule:

  • lead_times: "start/stop/step" in hours.
  • regions: regions to include as column blocks. If no region is specified, all regions are included.
  • variables: "VAR:M1,M2,..." entries. Omit :M1,M2,... to use all_metrics for that variable.
    Metric names can also expand by prefix: requesting ETS includes all matching categorical scores, such as ETS_gt_0p0, ETS_gt_0p001, ETS_gt_0p005.
    If no variable is specified, the script falls back to RMSE only for a default set of variables.

Other defaults (season, init_hour, metric settings, plot styling) live in the script's cfg.

Plot layout

The plotting script makes a few automatic layout decisions:

  • the longest region label is measured before rendering: col_width grows when necessary to prevent region header overlap, and the top margin/vertical separators adapt to the rendered header height
  • the longest metric label is measured before rendering: variable labels keep a fixed gap from metric labels, and horizontal group separators start from the measured metric-label area
  • the legend is centered on the scorecard area
  • the no-data legend entry only appears when missing values are present

TODOs

@dnerini dnerini marked this pull request as ready for review May 6, 2026 11:23
@dnerini dnerini requested review from dnerini and teobuz May 6, 2026 11:24
@dnerini dnerini requested review from frazane and jonasbhend May 6, 2026 15:56
@dnerini
Copy link
Copy Markdown
Member

dnerini commented May 6, 2026

looking good :)

image

@dnerini dnerini requested review from Louis-Frey May 6, 2026 16:01
Copy link
Copy Markdown
Contributor

@jonasbhend jonasbhend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. I really like the scorecards. Great work.

For future PRs, could you please add a short description of the changes (high-level overview) and - if necessary - also of the goals of the PR? That would be very helpful for the review.

As an additional suggestion, could we include the scorecard in the dashboard (I know we don't always want to produce it, but in case it is available it would be nice to include in a separate tab)?

Comment thread workflow/rules/report.smk Outdated
Comment thread workflow/Snakefile Outdated
Comment thread workflow/rules/report.smk Outdated
Comment thread workflow/scripts/report_scorecard.mo.py Outdated
Comment thread workflow/scripts/report_scorecard.mo.py Outdated
@dnerini dnerini self-assigned this May 26, 2026
@dnerini dnerini requested review from jonasbhend and removed request for dnerini May 26, 2026 07:18
Copy link
Copy Markdown
Contributor

@jonasbhend jonasbhend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have just left two minor comments. Seems all good to me otherwise.

Comment thread workflow/Snakefile Outdated
Comment thread workflow/rules/report.smk Outdated
Comment thread workflow/scripts/report_scorecard.mo.py Outdated
Comment thread workflow/Snakefile Outdated
Comment thread src/evalml/config.py
Comment thread workflow/scripts/report_scorecard.py Outdated
Comment thread workflow/scripts/report_scorecard.py Outdated
@dnerini
Copy link
Copy Markdown
Member

dnerini commented May 28, 2026

let's merge this, thank you all for all the precious feedback and comments!

@dnerini dnerini merged commit a288334 into main May 28, 2026
4 of 5 checks passed
@dnerini dnerini deleted the MRB-648-Scorecards-for-evalml branch May 28, 2026 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants