Skip to content

Conversation

@v0i0
Copy link
Contributor

@v0i0 v0i0 commented Dec 17, 2025

s…ator

proposal for #1161

WIP, needs more testing & features

basic idea: you run python -m helion.autotuner.aot_runner --benchmark "python your_benchmark.py" and you get files called _{filename}_{arch}.py containing the heuristic that maps shapes to configs.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 17, 2025
@v0i0 v0i0 changed the title first draft of a aot autotuning runner and cache and heuristics gener… Draft: first draft of a aot autotuning runner and cache and heuristics gener… Dec 17, 2025
@v0i0 v0i0 requested review from Chillee, jansel and yf225 December 17, 2025 01:43
@@ -0,0 +1,105 @@
#!/usr/bin/env python3
Copy link
Contributor

@yf225 yf225 Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about Horace's example, and curious: Would it make sense to support a "benchmark only" mode in the collect phase (or a separate phase) that skips autotuning and just measures existing configs against additional shapes (similar to secondary_inputs in Horace's RFC)? This would let users:

  1. Run collect on a small set of representative shapes (to do full autotune on)
  2. Run benchmark-only on a larger set of shapes (just measure)
  3. Build heuristics using the full timing matrix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script already lets you specify different benchmarks for all three phases of measurement, so you can collect on a different benchmark than the one you measure.

Is that what you are asking about or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nice! curious should we add an example showing this workflow? cc. @Chillee would this cover the original need of primary_inputs / secondary_inputs ?

@yf225 yf225 requested a review from mengluy0125 December 17, 2025 05:31

The AOT workflow consists of three phases:
1. Collect: Run benchmarks, autotuning each shape individually
2. Measure: Re-run benchmarks, measuring all configs across all shapes
Copy link
Contributor

@yf225 yf225 Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as we discussed, maybe "all shapes" is not exactly accurate, as user can customize what shapes to run in each phase)

@choijon5 choijon5 requested a review from gmagogsfm December 18, 2025 16:14
"""Represents a unique shape/dtype combination for a kernel."""

kernel_name: str
specialization_key: tuple[Any, ...]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So ShapeKey may correspond to a partially specialized shape? If so

  • Can ShapeKey collide? as in an input shape can be matched against multiple ShapeKeys
  • How much confidence can we have about a config that's best for a ShapeKey actually works well for an input shape that's wildly different at runtime? Take your tall-skinny, short-wide tensor as an example, if first dimension changes by 100x while second dimension stays the same, it would jump from tall-skinny to short-wide or vice-versa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the datastore stores all shape information (all shapes, strides, and data types) in addition to this

return model, accuracy, feature_names


def generate_heuristic_code(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the generated heuristic-based selection code ultimately what users are supposed to call in deployment?

This may be helpful to help tackle one of the challenges in Helion + vLLM, that is selecting best config based on batch_size, which varies per token. Note that it would have a fairly high bar for latency since it is triggered per token, so invoking a chunk of (complex?) Python logic might be too slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. this means it is interpretable and can be version controlled. depending on how complex it gets, we could compose it with the existing caching logic to cache the heuristic result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmagogsfm this is what an example decision tree output might look like: https://gist.github.com/v0i0/d6604662d7095a040ce0db049e192c14

@v0i0 v0i0 force-pushed the v0i0/autotune-heuristic branch from 3f0ced4 to 15f568d Compare December 22, 2025 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants