Fix examples workflow grafting config overrides by runame · Pull Request #257 · facebookresearch/optimizers

runame · 2026-04-25T16:04:11Z

Summary

The four grafting CLI overrides in .github/workflows/examples.yaml use the form
'optimizer.grafting_config={_target_:...,...}'. With OmegaConf's default merge
semantics, this merges the new mapping into the inherited
grafting_config from distributed_shampoo/examples/configs/optimizer/shampoo.yaml,
which currently defines:

grafting_config:
  _target_: distributed_shampoo.AdamPreconditionerConfig
  beta2: 0.999
  epsilon: 1e-8

So an override with _target_: AdaGradPreconditionerConfig ends up as
{_target_: AdaGrad, beta2: 0.999, epsilon: 1e-08} and Hydra
instantiation fails with:

TypeError: AdaGradPreconditionerConfig.__init__() got an unexpected keyword argument 'beta2'

SGDPreconditionerConfig has the same latent bug (it would inherit both
beta2 and epsilon), but execution stops at the AdaGrad line. Adam
and RMSprop happen to work because their _target_ accepts beta2.

This regression was introduced in commit 405622d ("Consolidate CIFAR-10
examples in Shampoo with hydra configs"), which switched the workflow
from argparse-based scripts to Hydra overrides without accounting for
dict-merge semantics.

Fix

Replace each optimizer.grafting_config={...} override with the
standard Hydra delete-then-add idiom:

```
'~optimizer.grafting_config' '+optimizer.grafting_config={...}'
```

This forces the inherited mapping to be discarded and the new mapping
inserted whole, so only the explicitly-listed keys are passed to the
target's constructor.

Verified locally with `hydra.compose`:

old override → `{target: AdaGrad, beta2: 0.999, epsilon: 1e-08}` (broken)
new override → `{target: AdaGrad, epsilon: 1e-08}` (correct)

Applied to all six grafting-override invocations in the file (CPU
single-GPU loop, GPU single-GPU, DDP CPU, DDP GPU) for consistency.

Test plan

CI examples job passes on this branch
AdaGrad, Adam, RMSprop, and SGD grafting cases all run to completion

Notes

Stacked on top of #256 (`fix/init-stale-rename-imports`). That PR's
import fix unblocked the `examples` job enough to surface this latent
merge bug. The branch is rebased on `fix/init-stale-rename-imports`,
so until #256 merges this diff will appear to include #256's commit.
Merge order: #256 first, then this.

🤖 Generated with Claude Code

`AmortizedPreconditionerConfig` was renamed to `BaseShampooPreconditionerConfig` and `ShampooPreconditionerConfig` to `ClassicShampooPreconditionerConfig` in `shampoo_types.py`, but `__init__.py` still imported and re-exported the old names. Replace them with the new names in the imports and `__all__`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The CLI overrides `optimizer.grafting_config={...}` use OmegaConf's default merge semantics, so fields from the base shampoo.yaml grafting_config (notably `beta2: 0.999`) leak into overrides whose `_target_` does not accept them — breaking the AdaGrad and SGD grafting cases. Replace each override with a delete (`~`) followed by add (`+`) so the new mapping fully replaces the inherited one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The `4-core-ubuntu-gpu-t4` runner currently has an outdated NVIDIA driver, so `torch.cuda.is_available()` returns False and the "Run DDP example on GPU." step crashes with `ProcessGroupNCCL is only supported with GPUs, no GPUs found!`. The "Run single GPU example on GPU." step "passes" only because torch silently falls back to CPU. Add a `gpu_check` step that probes `torch.cuda.is_available()` and gate both GPU-only steps on its output. If no GPU is detected, those steps are skipped (with a workflow warning) and the job stays green. When the runner image is fixed and a GPU is actually available, both steps run as before with no other changes needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

meta-codesync · 2026-04-29T22:01:06Z

@hjmshi has imported this pull request. If you are a Meta employee, you can view this in D103081868.

wz337

Review automatically exported from Phabricator review in Meta.

Summary: Resolves the remaining `pre-commit` and `type-check` job failures on `main` after PRs #256 (init imports) and #257 (examples workflow). Three commits, each separately reviewable: ### Commit 1: lint fixes (ruff E402 + E731) and copyright header normalization - **E402 (~92 violations across 12 files)**: every affected file had a duplicate triple-quoted copyright block at the top followed by the *real* module docstring (or, in `gpa/`, a `#`-comment copyright). Python only treats the first triple-quoted string as a docstring, so the duplicate counted as a non-import statement and triggered E402 for every subsequent import. Drop the duplicate so the real module docstring is the first statement. - **E731 (2 violations)** in `_shampoo_fully_shard_lossless_distributor.py:69` and `_shampoo_hybrid_shard_lossless_distributor.py:75`: rewrite `should_assign_param_idx = lambda i: ...` as a `def`. - **Copyright convention**: 79/79 untouched files use a single triple-quoted block. The 6 gpa files I touched (plus `gpa/gpa_adamw.py`, which had both styles) are normalized to that convention by dropping their `#`-comment copyright in favor of the triple-quoted form. ### Commit 2: ruff-format pass The pinned `ruff-format` (v0.8.0 in `.pre-commit-config.yaml`) considers 14 unrelated files mis-formatted (purely whitespace around `assert ..., (...)` wrapping). Applying the formatter so the hook passes on the first run instead of failing after auto-modifying files. ### Commit 3: mypy fixes (33 errors → 0) All 27 errors reported by CI plus 6 local-only errors that newer mypy/torch stubs surface: - **gpa/gpa_adamw.py** (10 errors): - `step()`: add `overload` pair to match `torch.optim.Optimizer` - `step()`: annotate per-group buffer lists as `list[Tensor]` - `step()`: skip empty parameter groups via `continue`; rename group-local `first_param` to `group_first_param` to avoid mypy union with the `Optional[Parameter]` from the train-mode check loop - **gpa/tests/gpa_test_utils.py:42**: annotate `devices` as `tuple[device, ...]` - **gpa/gpu_tests/gpa_adamw_numerics_test.py:31**: ignore missing `parameterized` stubs (no public stub package) - **gpa/examples/cifar10_example.py:115**: ignore `len()` on `Dataset[Any]` - **distributed_shampoo/distributed_shampoo.py**: - lines 669-671: drop redundant re-annotations on the `AdaGradPreconditionerConfig` branch (no-redef) - line 1632: move stray `# type: ignore` onto the `_pre_load_train_modes` assignment line so it actually applies - **distributed_shampoo/distributor/_shampoo_hybrid_shard_lossless_distributor.py:160**: replace `filter(lambda, ...)` with a generator expression to dodge a confusing `TypeGuard` overload mypy picks for `filter()` - **distributed_shampoo/distributor/shampoo_{ddp,hsdp,hybrid_shard}_distributor.py**: ignore `attr-defined` for the private `DeviceMesh._get_all_submeshes` — local-only, surfaces with current torch stubs - **distributed_shampoo/distributor/shampoo_distributor.py:228**: ignore narrowing of `map(partial(tuple), ...)` assignment — local-only - **distributed_shampoo/distributor/gpu_tests/shampoo_checkpoint_test.py:376,411** and **examples/parallelism.py:177,206**: ignore `arg-type` for `partial[FSDPModule]` / `FSDPModule` passed to a `Module` parameter (FSDPModule mixes in but mypy can't see it) - **distributed_shampoo/examples/utils.py:161**: annotate `sampler` - **distributed_shampoo/examples/tests/utils_test.py:228,240**: replace `assertIsNotNone` with `assert ... is not None` so mypy narrows `_fmt` before `assertIn` - **distributed_shampoo/utils/load_balancing_utils.py:57**: extend existing `# type: ignore[misc]` to also cover `call-overload` from `max(float, floating[Any])` ## Local verification ``` $ ruff check . All checks passed! $ ruff format --check . 93 files already formatted $ mypy . Success: no issues found in 93 source files ``` Pull Request resolved: #258 Test Plan: - [x] CI `pre-commit` job passes - [x] CI `type-check` job passes - [x] Other CI jobs (`tests`, `gpu-tests`, `examples`) unchanged ## Notes Stacked on top of #256 (`fix/init-stale-rename-imports`) and #257 (`fix/examples-beta2-config`). Until those merge, the diff for this PR will appear to include their commits. Merge order: #256 → #257 → this. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed By: wz337 Differential Revision: D103082000 Pulled By: hjmshi fbshipit-source-id: d4bf91517bc7426019a3fde42420e4408b2bbc59

meta-codesync · 2026-05-05T20:00:34Z

@hjmshi merged this pull request in b8aea3f.

runame and others added 2 commits April 11, 2026 20:38

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 25, 2026

runame mentioned this pull request Apr 25, 2026

Fix pre-commit (ruff) and mypy errors #258

Closed

3 tasks

wz337 approved these changes Apr 30, 2026

View reviewed changes

meta-codesync Bot closed this in b8aea3f May 5, 2026

facebook-github-tools Bot added the Merged label May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix examples workflow grafting config overrides#257

Fix examples workflow grafting config overrides#257
runame wants to merge 3 commits intofacebookresearch:mainfrom
runame:fix/examples-beta2-config

runame commented Apr 25, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Apr 29, 2026

Uh oh!

wz337 left a comment

Uh oh!

meta-codesync Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

runame commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Test plan

Notes

Uh oh!

meta-codesync Bot commented Apr 29, 2026

Uh oh!

wz337 left a comment

Choose a reason for hiding this comment

Uh oh!

meta-codesync Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

runame commented Apr 25, 2026 •

edited

Loading