Skip to content

Fix examples workflow grafting config overrides#257

Closed
runame wants to merge 3 commits intofacebookresearch:mainfrom
runame:fix/examples-beta2-config
Closed

Fix examples workflow grafting config overrides#257
runame wants to merge 3 commits intofacebookresearch:mainfrom
runame:fix/examples-beta2-config

Conversation

@runame
Copy link
Copy Markdown
Contributor

@runame runame commented Apr 25, 2026

Summary

The four grafting CLI overrides in .github/workflows/examples.yaml use the form
'optimizer.grafting_config={_target_:...,...}'. With OmegaConf's default merge
semantics, this merges the new mapping into the inherited
grafting_config from distributed_shampoo/examples/configs/optimizer/shampoo.yaml,
which currently defines:

grafting_config:
  _target_: distributed_shampoo.AdamPreconditionerConfig
  beta2: 0.999
  epsilon: 1e-8

So an override with _target_: AdaGradPreconditionerConfig ends up as
{_target_: AdaGrad, beta2: 0.999, epsilon: 1e-08} and Hydra
instantiation fails with:

TypeError: AdaGradPreconditionerConfig.__init__() got an unexpected keyword argument 'beta2'

SGDPreconditionerConfig has the same latent bug (it would inherit both
beta2 and epsilon), but execution stops at the AdaGrad line. Adam
and RMSprop happen to work because their _target_ accepts beta2.

This regression was introduced in commit 405622d ("Consolidate CIFAR-10
examples in Shampoo with hydra configs"), which switched the workflow
from argparse-based scripts to Hydra overrides without accounting for
dict-merge semantics.

Fix

Replace each optimizer.grafting_config={...} override with the
standard Hydra delete-then-add idiom:

```
'~optimizer.grafting_config' '+optimizer.grafting_config={...}'
```

This forces the inherited mapping to be discarded and the new mapping
inserted whole, so only the explicitly-listed keys are passed to the
target's constructor.

Verified locally with `hydra.compose`:

  • old override → `{target: AdaGrad, beta2: 0.999, epsilon: 1e-08}` (broken)
  • new override → `{target: AdaGrad, epsilon: 1e-08}` (correct)

Applied to all six grafting-override invocations in the file (CPU
single-GPU loop, GPU single-GPU, DDP CPU, DDP GPU) for consistency.

Test plan

  • CI examples job passes on this branch
  • AdaGrad, Adam, RMSprop, and SGD grafting cases all run to completion

Notes

Stacked on top of #256 (`fix/init-stale-rename-imports`). That PR's
import fix unblocked the `examples` job enough to surface this latent
merge bug. The branch is rebased on `fix/init-stale-rename-imports`,
so until #256 merges this diff will appear to include #256's commit.
Merge order: #256 first, then this.

🤖 Generated with Claude Code

runame and others added 2 commits April 11, 2026 20:38
`AmortizedPreconditionerConfig` was renamed to
`BaseShampooPreconditionerConfig` and `ShampooPreconditionerConfig` to
`ClassicShampooPreconditionerConfig` in `shampoo_types.py`, but
`__init__.py` still imported and re-exported the old names. Replace
them with the new names in the imports and `__all__`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CLI overrides `optimizer.grafting_config={...}` use OmegaConf's
default merge semantics, so fields from the base shampoo.yaml
grafting_config (notably `beta2: 0.999`) leak into overrides whose
`_target_` does not accept them — breaking the AdaGrad and SGD
grafting cases. Replace each override with a delete (`~`) followed by
add (`+`) so the new mapping fully replaces the inherited one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 25, 2026
@runame runame mentioned this pull request Apr 25, 2026
3 tasks
The `4-core-ubuntu-gpu-t4` runner currently has an outdated NVIDIA
driver, so `torch.cuda.is_available()` returns False and the
"Run DDP example on GPU." step crashes with
`ProcessGroupNCCL is only supported with GPUs, no GPUs found!`.
The "Run single GPU example on GPU." step "passes" only because
torch silently falls back to CPU.

Add a `gpu_check` step that probes `torch.cuda.is_available()` and
gate both GPU-only steps on its output. If no GPU is detected, those
steps are skipped (with a workflow warning) and the job stays green.
When the runner image is fixed and a GPU is actually available, both
steps run as before with no other changes needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 29, 2026

@hjmshi has imported this pull request. If you are a Meta employee, you can view this in D103081868.

Copy link
Copy Markdown
Contributor

@wz337 wz337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

@meta-codesync meta-codesync Bot closed this in b8aea3f May 5, 2026
meta-codesync Bot pushed a commit that referenced this pull request May 5, 2026
Summary:
Resolves the remaining `pre-commit` and `type-check` job failures on `main` after PRs #256 (init imports) and #257 (examples workflow). Three commits, each separately reviewable:

### Commit 1: lint fixes (ruff E402 + E731) and copyright header normalization

- **E402 (~92 violations across 12 files)**: every affected file had a duplicate triple-quoted copyright block at the top followed by the *real* module docstring (or, in `gpa/`, a `#`-comment copyright). Python only treats the first triple-quoted string as a docstring, so the duplicate counted as a non-import statement and triggered E402 for every subsequent import. Drop the duplicate so the real module docstring is the first statement.
- **E731 (2 violations)** in `_shampoo_fully_shard_lossless_distributor.py:69` and `_shampoo_hybrid_shard_lossless_distributor.py:75`: rewrite `should_assign_param_idx = lambda i: ...` as a `def`.
- **Copyright convention**: 79/79 untouched files use a single triple-quoted block. The 6 gpa files I touched (plus `gpa/gpa_adamw.py`, which had both styles) are normalized to that convention by dropping their `#`-comment copyright in favor of the triple-quoted form.

### Commit 2: ruff-format pass

The pinned `ruff-format` (v0.8.0 in `.pre-commit-config.yaml`) considers 14 unrelated files mis-formatted (purely whitespace around `assert ..., (...)` wrapping). Applying the formatter so the hook passes on the first run instead of failing after auto-modifying files.

### Commit 3: mypy fixes (33 errors → 0)

All 27 errors reported by CI plus 6 local-only errors that newer mypy/torch stubs surface:

- **gpa/gpa_adamw.py** (10 errors):
  - `step()`: add `overload` pair to match `torch.optim.Optimizer`
  - `step()`: annotate per-group buffer lists as `list[Tensor]`
  - `step()`: skip empty parameter groups via `continue`; rename group-local `first_param` to `group_first_param` to avoid mypy union with the `Optional[Parameter]` from the train-mode check loop
- **gpa/tests/gpa_test_utils.py:42**: annotate `devices` as `tuple[device, ...]`
- **gpa/gpu_tests/gpa_adamw_numerics_test.py:31**: ignore missing `parameterized` stubs (no public stub package)
- **gpa/examples/cifar10_example.py:115**: ignore `len()` on `Dataset[Any]`
- **distributed_shampoo/distributed_shampoo.py**:
  - lines 669-671: drop redundant re-annotations on the `AdaGradPreconditionerConfig` branch (no-redef)
  - line 1632: move stray `# type: ignore` onto the `_pre_load_train_modes` assignment line so it actually applies
- **distributed_shampoo/distributor/_shampoo_hybrid_shard_lossless_distributor.py:160**: replace `filter(lambda, ...)` with a generator expression to dodge a confusing `TypeGuard` overload mypy picks for `filter()`
- **distributed_shampoo/distributor/shampoo_{ddp,hsdp,hybrid_shard}_distributor.py**: ignore `attr-defined` for the private `DeviceMesh._get_all_submeshes` — local-only, surfaces with current torch stubs
- **distributed_shampoo/distributor/shampoo_distributor.py:228**: ignore narrowing of `map(partial(tuple), ...)` assignment — local-only
- **distributed_shampoo/distributor/gpu_tests/shampoo_checkpoint_test.py:376,411** and **examples/parallelism.py:177,206**: ignore `arg-type` for `partial[FSDPModule]` / `FSDPModule` passed to a `Module` parameter (FSDPModule mixes in but mypy can't see it)
- **distributed_shampoo/examples/utils.py:161**: annotate `sampler`
- **distributed_shampoo/examples/tests/utils_test.py:228,240**: replace `assertIsNotNone` with `assert ... is not None` so mypy narrows `_fmt` before `assertIn`
- **distributed_shampoo/utils/load_balancing_utils.py:57**: extend existing `# type: ignore[misc]` to also cover `call-overload` from `max(float, floating[Any])`

## Local verification

```
$ ruff check .
All checks passed!
$ ruff format --check .
93 files already formatted
$ mypy .
Success: no issues found in 93 source files
```

Pull Request resolved: #258

Test Plan:
- [x] CI `pre-commit` job passes
- [x] CI `type-check` job passes
- [x] Other CI jobs (`tests`, `gpu-tests`, `examples`) unchanged

## Notes

Stacked on top of #256 (`fix/init-stale-rename-imports`) and #257 (`fix/examples-beta2-config`). Until those merge, the diff for this PR will appear to include their commits. Merge order: #256#257 → this.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed By: wz337

Differential Revision: D103082000

Pulled By: hjmshi

fbshipit-source-id: d4bf91517bc7426019a3fde42420e4408b2bbc59
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 5, 2026

@hjmshi merged this pull request in b8aea3f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants