You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Goal: Implement a Kubernetes operator that watches Experiment Custom Resources and reconciles them into fully functional Testkube TestWorkflows executing the 4-phase evaluation pipeline (setup → run → evaluate → publish).
Description:
Today, running the testbench evaluation pipeline in Kubernetes requires manually creating a ConfigMap with the experiment definition, a TestWorkflow chaining the phase templates, and optionally a TestTrigger for automatic execution on agent changes (see deploy/local/). This story introduces a testbench-operator — analogous to the agent-runtime-operator — that automates this entire process. Users define an Experiment CR with their evaluation configuration, and the operator generates all required Kubernetes resources.
Define an Experiment Custom Resource that captures all evaluation parameters. The CRD supports two data source modes: inline scenarios (defined directly in the spec) and external datasets (loaded from S3 or other sources via the setup phase).
apiVersion: testbench.agentic-layer.ai/v1alpha1kind: Experimentmetadata:
name: weather-agent-evaluationnamespace: testkubespec:
# Reference to the agent under testagentRef:
name: weather-agentnamespace: sample-agents# External dataset source (triggers setup phase)dataset:
s3:
bucket: evaluation-datasetskey: weather-agent/dataset.csv# Alternative: url: "http://data-server:8000/dataset.csv"# LLM configuration for evaluationllmAsAJudgeModel: "gemini-2.5-flash-lite"defaultThreshold: 0.9# Inline scenario definitions (alternative to dataset, skips setup phase)scenarios:
- name: "Weather Query - New York"steps:
- input: "What is the weather in New York?"reference:
toolCalls:
- name: get_weatherargs:
city: "New York"metrics:
- metricName: AgentGoalAccuracyWithoutReference
- metricName: ToolCallAccuracy
- metricName: TopicAdherencethreshold: 0.8parameters:
mode: precision# Automatic trigger configurationtrigger:
enabled: trueevent: modified # Trigger on agent deployment changesconcurrencyPolicy: allow
Key design decisions:
agentRef references an Agent CR (resolved to its A2A endpoint by the operator)
dataset and scenarios are mutually exclusive — if dataset is set, the setup phase runs; if scenarios is set inline, the operator generates the experiment.json ConfigMap directly
trigger controls automatic TestTrigger creation, watching the referenced agent's Deployment
Goal: Implement a Kubernetes operator that watches
ExperimentCustom Resources and reconciles them into fully functional Testkube TestWorkflows executing the 4-phase evaluation pipeline (setup → run → evaluate → publish).Description:
Today, running the testbench evaluation pipeline in Kubernetes requires manually creating a ConfigMap with the experiment definition, a TestWorkflow chaining the phase templates, and optionally a TestTrigger for automatic execution on agent changes (see
deploy/local/). This story introduces a testbench-operator — analogous to the agent-runtime-operator — that automates this entire process. Users define anExperimentCR with their evaluation configuration, and the operator generates all required Kubernetes resources.Related: #21
Key Deliverables
1. Experiment CRD Specification
Define an
ExperimentCustom Resource that captures all evaluation parameters. The CRD supports two data source modes: inline scenarios (defined directly in the spec) and external datasets (loaded from S3 or other sources via the setup phase).Key design decisions:
agentRefreferences anAgentCR (resolved to its A2A endpoint by the operator)datasetandscenariosare mutually exclusive — ifdatasetis set, the setup phase runs; ifscenariosis set inline, the operator generates theexperiment.jsonConfigMap directlytriggercontrols automatic TestTrigger creation, watching the referenced agent's Deployment2. Operator Implementation (Go + Operator SDK)
Built with Go and Operator SDK, consistent with agent-runtime-operator.
Reconciliation logic — on each
ExperimentCR change, the operator ensures:{name}-experiment)experiment.json{name}-workflow){name}-trigger)spec.trigger.enabled: truePhase template selection:
spec.datasetis set → includesetup-templateas first phase (downloads external dataset)spec.scenariosis set → skip setup, inject experiment.json via ConfigMaprun-template,evaluate-template,publish-template,visualize-templateConfig parameter mapping:
spec.agentRef→ resolved to agent URL →run-template.config.agentUrlspec.dataset.s3→setup-template.config.bucket+setup-template.config.keyspec.llmAsAJudgeModel→ embedded in experiment.jsonotel-config)3. Status Reporting
The
ExperimentCR status subresource reports reconciliation and workflow state:4. Garbage Collection & Ownership
ownerReferencespointing to theExperimentCRExperimentCR cascades deletion to ConfigMap, TestWorkflow, and TestTriggerExperimentCR trigger re-reconciliation (update existing resources)5. Helm Chart Integration
Acceptance Criteria
CRD & Validation
ExperimentCRD schema defined with OpenAPI validationdatasetandscenariosare mutually exclusive (validation webhook or CEL rule)agentRefmust reference an existing Agent CR in the specified namespacedefaultThresholdvalidated to 0-1 rangethresholdvalidated to 0-1 rangeReconciliation
ExperimentCR generates ConfigMap + TestWorkflow + TestTriggerExperimentCR updates generated resources accordinglyExperimentCR cascades deletion via ownerReferencesagentRef(Agent CR status or service DNS)Trigger
spec.trigger.enabled: truespec.trigger.enabled: falseor omittedStatus
Testing & Quality
Deployment
Implementation Status
References
deploy/local/example-workflow.yaml— Current manual TestWorkflow definitiondeploy/local/experiment.yaml— Current manual experiment ConfigMapdeploy/local/example-workflow-trigger.yaml— Current manual TestTriggerchart/templates/— Existing TestWorkflowTemplate Helm templatesSub-Tasks
Supersedes
Testbench Operator #21(closed, fully covered by this issue)