Fix examples workflow grafting config overrides#257
Closed
runame wants to merge 3 commits intofacebookresearch:mainfrom
Closed
Fix examples workflow grafting config overrides#257runame wants to merge 3 commits intofacebookresearch:mainfrom
runame wants to merge 3 commits intofacebookresearch:mainfrom
Conversation
`AmortizedPreconditionerConfig` was renamed to `BaseShampooPreconditionerConfig` and `ShampooPreconditionerConfig` to `ClassicShampooPreconditionerConfig` in `shampoo_types.py`, but `__init__.py` still imported and re-exported the old names. Replace them with the new names in the imports and `__all__`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CLI overrides `optimizer.grafting_config={...}` use OmegaConf's
default merge semantics, so fields from the base shampoo.yaml
grafting_config (notably `beta2: 0.999`) leak into overrides whose
`_target_` does not accept them — breaking the AdaGrad and SGD
grafting cases. Replace each override with a delete (`~`) followed by
add (`+`) so the new mapping fully replaces the inherited one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
The `4-core-ubuntu-gpu-t4` runner currently has an outdated NVIDIA driver, so `torch.cuda.is_available()` returns False and the "Run DDP example on GPU." step crashes with `ProcessGroupNCCL is only supported with GPUs, no GPUs found!`. The "Run single GPU example on GPU." step "passes" only because torch silently falls back to CPU. Add a `gpu_check` step that probes `torch.cuda.is_available()` and gate both GPU-only steps on its output. If no GPU is detected, those steps are skipped (with a workflow warning) and the job stays green. When the runner image is fixed and a GPU is actually available, both steps run as before with no other changes needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@hjmshi has imported this pull request. If you are a Meta employee, you can view this in D103081868. |
wz337
approved these changes
Apr 30, 2026
Contributor
wz337
left a comment
There was a problem hiding this comment.
Review automatically exported from Phabricator review in Meta.
meta-codesync Bot
pushed a commit
that referenced
this pull request
May 5, 2026
Summary: Resolves the remaining `pre-commit` and `type-check` job failures on `main` after PRs #256 (init imports) and #257 (examples workflow). Three commits, each separately reviewable: ### Commit 1: lint fixes (ruff E402 + E731) and copyright header normalization - **E402 (~92 violations across 12 files)**: every affected file had a duplicate triple-quoted copyright block at the top followed by the *real* module docstring (or, in `gpa/`, a `#`-comment copyright). Python only treats the first triple-quoted string as a docstring, so the duplicate counted as a non-import statement and triggered E402 for every subsequent import. Drop the duplicate so the real module docstring is the first statement. - **E731 (2 violations)** in `_shampoo_fully_shard_lossless_distributor.py:69` and `_shampoo_hybrid_shard_lossless_distributor.py:75`: rewrite `should_assign_param_idx = lambda i: ...` as a `def`. - **Copyright convention**: 79/79 untouched files use a single triple-quoted block. The 6 gpa files I touched (plus `gpa/gpa_adamw.py`, which had both styles) are normalized to that convention by dropping their `#`-comment copyright in favor of the triple-quoted form. ### Commit 2: ruff-format pass The pinned `ruff-format` (v0.8.0 in `.pre-commit-config.yaml`) considers 14 unrelated files mis-formatted (purely whitespace around `assert ..., (...)` wrapping). Applying the formatter so the hook passes on the first run instead of failing after auto-modifying files. ### Commit 3: mypy fixes (33 errors → 0) All 27 errors reported by CI plus 6 local-only errors that newer mypy/torch stubs surface: - **gpa/gpa_adamw.py** (10 errors): - `step()`: add `overload` pair to match `torch.optim.Optimizer` - `step()`: annotate per-group buffer lists as `list[Tensor]` - `step()`: skip empty parameter groups via `continue`; rename group-local `first_param` to `group_first_param` to avoid mypy union with the `Optional[Parameter]` from the train-mode check loop - **gpa/tests/gpa_test_utils.py:42**: annotate `devices` as `tuple[device, ...]` - **gpa/gpu_tests/gpa_adamw_numerics_test.py:31**: ignore missing `parameterized` stubs (no public stub package) - **gpa/examples/cifar10_example.py:115**: ignore `len()` on `Dataset[Any]` - **distributed_shampoo/distributed_shampoo.py**: - lines 669-671: drop redundant re-annotations on the `AdaGradPreconditionerConfig` branch (no-redef) - line 1632: move stray `# type: ignore` onto the `_pre_load_train_modes` assignment line so it actually applies - **distributed_shampoo/distributor/_shampoo_hybrid_shard_lossless_distributor.py:160**: replace `filter(lambda, ...)` with a generator expression to dodge a confusing `TypeGuard` overload mypy picks for `filter()` - **distributed_shampoo/distributor/shampoo_{ddp,hsdp,hybrid_shard}_distributor.py**: ignore `attr-defined` for the private `DeviceMesh._get_all_submeshes` — local-only, surfaces with current torch stubs - **distributed_shampoo/distributor/shampoo_distributor.py:228**: ignore narrowing of `map(partial(tuple), ...)` assignment — local-only - **distributed_shampoo/distributor/gpu_tests/shampoo_checkpoint_test.py:376,411** and **examples/parallelism.py:177,206**: ignore `arg-type` for `partial[FSDPModule]` / `FSDPModule` passed to a `Module` parameter (FSDPModule mixes in but mypy can't see it) - **distributed_shampoo/examples/utils.py:161**: annotate `sampler` - **distributed_shampoo/examples/tests/utils_test.py:228,240**: replace `assertIsNotNone` with `assert ... is not None` so mypy narrows `_fmt` before `assertIn` - **distributed_shampoo/utils/load_balancing_utils.py:57**: extend existing `# type: ignore[misc]` to also cover `call-overload` from `max(float, floating[Any])` ## Local verification ``` $ ruff check . All checks passed! $ ruff format --check . 93 files already formatted $ mypy . Success: no issues found in 93 source files ``` Pull Request resolved: #258 Test Plan: - [x] CI `pre-commit` job passes - [x] CI `type-check` job passes - [x] Other CI jobs (`tests`, `gpu-tests`, `examples`) unchanged ## Notes Stacked on top of #256 (`fix/init-stale-rename-imports`) and #257 (`fix/examples-beta2-config`). Until those merge, the diff for this PR will appear to include their commits. Merge order: #256 → #257 → this. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed By: wz337 Differential Revision: D103082000 Pulled By: hjmshi fbshipit-source-id: d4bf91517bc7426019a3fde42420e4408b2bbc59
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The four grafting CLI overrides in
.github/workflows/examples.yamluse the form'optimizer.grafting_config={_target_:...,...}'. With OmegaConf's default mergesemantics, this merges the new mapping into the inherited
grafting_configfromdistributed_shampoo/examples/configs/optimizer/shampoo.yaml,which currently defines:
So an override with
_target_: AdaGradPreconditionerConfigends up as{_target_: AdaGrad, beta2: 0.999, epsilon: 1e-08}and Hydrainstantiation fails with:
SGDPreconditionerConfighas the same latent bug (it would inherit bothbeta2andepsilon), but execution stops at the AdaGrad line. Adamand RMSprop happen to work because their
_target_acceptsbeta2.This regression was introduced in commit
405622d("Consolidate CIFAR-10examples in Shampoo with hydra configs"), which switched the workflow
from argparse-based scripts to Hydra overrides without accounting for
dict-merge semantics.
Fix
Replace each
optimizer.grafting_config={...}override with thestandard Hydra delete-then-add idiom:
```
'~optimizer.grafting_config' '+optimizer.grafting_config={...}'
```
This forces the inherited mapping to be discarded and the new mapping
inserted whole, so only the explicitly-listed keys are passed to the
target's constructor.
Verified locally with `hydra.compose`:
Applied to all six grafting-override invocations in the file (CPU
single-GPU loop, GPU single-GPU, DDP CPU, DDP GPU) for consistency.
Test plan
Notes
Stacked on top of #256 (`fix/init-stale-rename-imports`). That PR's
import fix unblocked the `examples` job enough to surface this latent
merge bug. The branch is rebased on `fix/init-stale-rename-imports`,
so until #256 merges this diff will appear to include #256's commit.
Merge order: #256 first, then this.
🤖 Generated with Claude Code