-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Workflow template engine: missing-key handling for custom/enriched fields
Summary
Workflow templates that reference custom or enriched alert fields (fields not guaranteed by the upstream provider schema) currently crash with RenderException because render_context() always renders with safe=True. There is no way for a workflow author to express a default value for a field that may or may not be present.
Background: two distinct categories of missing fields
Alert fields in Keep come from two different sources, and they need to be handled differently:
1. Provider schema fields
Fields that are part of the upstream service's alert schema — e.g. Grafana's panelUrl, dashboardUrl, valueString; Prometheus's instance, job, value. These fields:
- Have a known, documented shape from the upstream provider
- Should always be present if the provider's
_format_alert()is correct - When absent, it indicates a gap in the provider implementation or a schema change upstream
Correct handling: the provider is responsible. Each provider should explicitly default any field that may be absent in _format_alert(), and the absence should be logged so that provider drift from upstream is visible and can be fixed. This is already the pattern in the codebase and is the right place to enforce schema contracts.
2. Custom and enriched fields
Fields that a workflow author adds via enrich_alert, custom labels, calculated values, or integration-specific enrichments. Examples: slack_timestamp (stored after first Slack post), tenant-specific label mappings, fields from a secondary lookup step. These fields:
- Are not part of any provider schema
- May or may not be present depending on alert history and workflow state
- Are inherently optional from the workflow author's perspective
No provider can default these — they are outside the scope of any upstream service. The workflow template itself is the only appropriate place to handle their optionality.
The problem
render_context() in iohandler.py hardcodes safe=True for all string with: parameters. When Chevron encounters a missing key under safe=True, it warns to stderr and _render() converts that warning into a RenderException, aborting the action.
There is currently no supported syntax in workflow YAML for a workflow author to say: "render this field, but if it's missing use a default value". The keep.* function system cannot help here because it is evaluated after Mustache rendering — by the time keep.default(...) would run, the RenderException has already been thrown.
Proposed solution
A two-part approach that keeps the two categories cleanly separated:
Part A — Provider-level defaulting (already the right pattern, no change needed)
Providers continue to own and default their schema fields in _format_alert(). If a provider emits an alert without a field that is part of its upstream schema, the provider implementation should be fixed. This keeps schema contracts explicit and auditable.
Part B — Mustache section helpers in workflow context (new)
Inject a set of named lambda helpers into the Chevron render context for every workflow render. These use the Mustache lambda spec — a callable used as a section tag renders the inner content and can transform or default the result:
# Empty or missing → "N/A"
message: "Panel: {{#fn.na}}{{ alert.panelUrl }}{{/fn.na}}"
# Empty or missing → ""
message: "Thread: {{#fn.default}}{{ alert.slack_timestamp }}{{/fn.default}}"
# Transform
message: "Severity: {{#fn.upper}}{{ alert.severity }}{{/fn.upper}}"Proposed initial helper set injected globally into render context:
WORKFLOW_HELPERS = {
"fn": {
"default": lambda text, render: render(text) or "",
"na": lambda text, render: render(text) or "N/A",
"upper": lambda text, render: render(text).upper(),
"lower": lambda text, render: render(text).lower(),
"strip": lambda text, render: render(text).strip(),
}
}These lambdas receive the rendered inner text, so they operate after field resolution. An absent key renders to "" inside the lambda (Chevron does not throw when safe=False is used, which the lambda call path uses), and the lambda applies the default.
Required changes
Backend (keep/iohandler/iohandler.py):
- Define
WORKFLOW_HELPERSdict with the lambda set - Merge it into the context passed to
chevron.render()inrender_context()andrender_recursively()
Frontend (keep-ui/entities/workflows/lib/mustache.ts):
extractMustacheVariables()currently validates every token matched byMUSTACHE_REGEXagainstALLOWED_MUSTACHE_VARIABLE_REGEX = /^[a-zA-Z0-9._-\s]+$/- Section open
{{#fn.na}}, section close{{/fn.na}}, and inverted{{^...}}tokens start with#,/,^and fail this regex, producing spurious "invalid variable" warnings in the builder - Fix: filter out tokens that start with
#,/,^,!,>before applying the variable name validation — these are Mustache sigils, not variable references
What this does not change
- Existing workflows with no
fn.*references are unaffected - Provider
_format_alert()defaulting remains the correct place for schema fields - The
keep.*function preprocessor is unaffected - No changes to deduplication, fingerprinting, or alert storage
Alternatives considered
| Approach | Why not chosen |
|---|---|
Global safe=False |
Silently swallows template typos — {{ alert.naem }} renders "" with no feedback |
Per-workflow safe: false flag |
Authors can opt out of safety with no defaulting syntax; schema change required |
Pre-render context patch (scan for missing keys, inject "") |
Two-pass render complexity; still no per-field default control |
| Jinja2 migration | | default("N/A") filter is the right UX, but a breaking migration for all existing workflows |
Status
- Chevron lambda behaviour verified locally (flat and nested dict helpers, warn=False path)
- Backend injection implemented in
keep/iohandler/iohandler.py—WORKFLOW_HELPERSmerged into Chevron context;safe=Falseauto-enabled when{{#fn.detected - Frontend
mustache.tsfix implemented — sigil tokens filtered inextractMustacheVariables() - Workflow example:
{{#fn.default}}{{ alert.silenceURL }}{{/fn.default}}deployed on dev (Neo Grafana → Slack workflow, revision 4) - Validated end-to-end on dev environment (all five helpers: fn.default, fn.na, fn.upper, fn.lower, fn.strip — exercised by real alert firing on dev Keep 0.49.0-svt.10)