Description
Context
I've been exploring the Prompt API for use cases that involve processing multiple independent prompts—things like classifying a batch of emails, summarizing multiple documents, or running the same analysis across a dataset. Currently, the API doesn't seem to offer a way to batch these requests efficiently.
Current Approach
Today, if I want to process multiple prompts, I have to do something like:
const items = ["Classify this email: ...", "Classify this email: ...", /* 50 more */];
const results = await Promise.all(
items.map(async (prompt) => {
const session = await LanguageModel.create({
initialPrompts: [{ role: "system", content: "You are a classifier." }]
});
const result = await session.prompt(prompt);
session.destroy();
return result;
})
);
Or, using the clone pattern:
const templateSession = await LanguageModel.create({
initialPrompts: [{ role: "system", content: "You are a classifier." }]
});
const results = await Promise.all(
items.map(async (prompt) => {
const session = await templateSession.clone();
const result = await session.prompt(prompt);
session.destroy();
return result;
})
);
The Problem
Both approaches have significant overhead:
- No shared computation — Each session/clone is independent; there's no opportunity for the browser to batch inference operations at the model level
- Session creation overhead — Creating or cloning sessions for each prompt adds latency
- Missed optimization opportunities — Modern inference engines (like vLLM, TensorRT-LLM, etc.) use techniques like continuous batching and KV cache sharing to dramatically improve throughput when processing multiple requests
For on-device models especially, the GPU/NPU could be significantly underutilized when processing prompts one at a time.
Proposal
I'd like to suggest considering a batch inference API that allows developers to submit multiple independent prompts for efficient parallel processing.
Option A: Static batch method
const results = await LanguageModel.batchPrompt(
[
"Classify: Is this spam? 'You won a million dollars!'",
"Classify: Is this spam? 'Meeting tomorrow at 3pm'",
"Classify: Is this spam? 'Click here for free iPhone'"
],
{
initialPrompts: [{ role: "system", content: "Respond with: spam or not_spam" }],
temperature: 0.2
}
);
// results = ["spam", "not_spam", "spam"]
Option B: Session-based batch method
const session = await LanguageModel.create({
initialPrompts: [{ role: "system", content: "Respond with: spam or not_spam" }]
});
const results = await session.batchPrompt([
"Classify: 'You won a million dollars!'",
"Classify: 'Meeting tomorrow at 3pm'",
"Classify: 'Click here for free iPhone'"
]);
Option C: Streaming batch with callbacks
For large batches where you want results as they complete:
const batch = session.createBatch([prompt1, prompt2, prompt3, /* ... */]);
batch.onResult((index, result) => {
console.log(`Prompt ${index} completed: ${result}`);
});
batch.onComplete((allResults) => {
console.log("All done!", allResults);
});
await batch.start();
Implementation Considerations
The browser/runtime could implement this by:
- Continuous batching — Dynamically grouping prompts and interleaving their decoding steps
- KV cache optimization — Sharing prefix computation when prompts share the same system prompt
- Adaptive concurrency — Automatically tuning batch size based on device capabilities and memory constraints
- Graceful degradation — Falling back to sequential processing on devices that don't support batching
Use Cases
- Bulk classification — Spam detection, sentiment analysis, content moderation
- Data extraction — Parsing structured data from many documents
- Batch transformations — Summarizing, translating, or rewriting multiple items
- Testing/evaluation — Running a model against a test dataset
Prior Art
- vLLM — Continuous batching for high-throughput LLM serving
- TensorRT-LLM — In-flight batching for NVIDIA GPUs
- OpenAI Batch API — Async batch processing for bulk workloads (different context, but similar developer need)
Description
Context
I've been exploring the Prompt API for use cases that involve processing multiple independent prompts—things like classifying a batch of emails, summarizing multiple documents, or running the same analysis across a dataset. Currently, the API doesn't seem to offer a way to batch these requests efficiently.
Current Approach
Today, if I want to process multiple prompts, I have to do something like:
Or, using the clone pattern:
The Problem
Both approaches have significant overhead:
For on-device models especially, the GPU/NPU could be significantly underutilized when processing prompts one at a time.
Proposal
I'd like to suggest considering a batch inference API that allows developers to submit multiple independent prompts for efficient parallel processing.
Option A: Static batch method
Option B: Session-based batch method
Option C: Streaming batch with callbacks
For large batches where you want results as they complete:
Implementation Considerations
The browser/runtime could implement this by:
Use Cases
Prior Art