-
Notifications
You must be signed in to change notification settings - Fork 89
Draft: first draft of a aot autotuning runner and cache and heuristics gener… #1278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| @@ -0,0 +1,105 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking about Horace's example, and curious: Would it make sense to support a "benchmark only" mode in the collect phase (or a separate phase) that skips autotuning and just measures existing configs against additional shapes (similar to secondary_inputs in Horace's RFC)? This would let users:
- Run
collecton a small set of representative shapes (to do full autotune on) - Run benchmark-only on a larger set of shapes (just measure)
- Build heuristics using the full timing matrix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script already lets you specify different benchmarks for all three phases of measurement, so you can collect on a different benchmark than the one you measure.
Is that what you are asking about or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah nice! curious should we add an example showing this workflow? cc. @Chillee would this cover the original need of primary_inputs / secondary_inputs ?
|
|
||
| The AOT workflow consists of three phases: | ||
| 1. Collect: Run benchmarks, autotuning each shape individually | ||
| 2. Measure: Re-run benchmarks, measuring all configs across all shapes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(as we discussed, maybe "all shapes" is not exactly accurate, as user can customize what shapes to run in each phase)
| """Represents a unique shape/dtype combination for a kernel.""" | ||
|
|
||
| kernel_name: str | ||
| specialization_key: tuple[Any, ...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So ShapeKey may correspond to a partially specialized shape? If so
- Can ShapeKey collide? as in an input shape can be matched against multiple ShapeKeys
- How much confidence can we have about a config that's best for a ShapeKey actually works well for an input shape that's wildly different at runtime? Take your tall-skinny, short-wide tensor as an example, if first dimension changes by 100x while second dimension stays the same, it would jump from tall-skinny to short-wide or vice-versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the datastore stores all shape information (all shapes, strides, and data types) in addition to this
| return model, accuracy, feature_names | ||
|
|
||
|
|
||
| def generate_heuristic_code( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the generated heuristic-based selection code ultimately what users are supposed to call in deployment?
This may be helpful to help tackle one of the challenges in Helion + vLLM, that is selecting best config based on batch_size, which varies per token. Note that it would have a fairly high bar for latency since it is triggered per token, so invoking a chunk of (complex?) Python logic might be too slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. this means it is interpretable and can be version controlled. depending on how complex it gets, we could compose it with the existing caching logic to cache the heuristic result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmagogsfm this is what an example decision tree output might look like: https://gist.github.com/v0i0/d6604662d7095a040ce0db049e192c14
3f0ced4 to
15f568d
Compare
s…ator
proposal for #1161
WIP, needs more testing & features
basic idea: you run
python -m helion.autotuner.aot_runner --benchmark "python your_benchmark.py"and you get files called_{filename}_{arch}.pycontaining the heuristic that maps shapes to configs.