Model Ensembles: Several Low-Cost LLMs, One Better Answer
June 17, 2026
Picking one model for a hard prompt is a gamble. The frontier model is accurate but expensive and slow; the cheap model is fast but occasionally confidently wrong. Most teams pick one and live with the trade-off.
A model ensemble removes the choice. Today we're introducing model ensembles in AISIX AI Gateway — a new virtual-model type that asks several models the same question in parallel and has a judge model reconcile their answers into one. You call it like any other model; the gateway does the rest.
What is a model ensemble?
A model ensemble is a single, caller-visible model name that fans one /v1/chat/completions request out to a configured panel of models in parallel, then uses a judge model to synthesize their answers into one final response. The fan-out and synthesis happen entirely server-side in the gateway — your client sends one request and gets one answer back, through the same OpenAI-compatible API you already use.
| Term | What it is |
|---|---|
| Panel | The set of models that independently answer the prompt, in parallel. Two or more, any mix of providers. |
| Judge | A model that reads the panel's answers and writes the single final answer the caller receives. |
| Ensemble model | The caller-visible alias that wires a panel + a judge together. Clients call it by name; they never see the panel. |
It's the third virtual-model kind in AISIX, alongside direct models (one alias → one upstream) and routing models (one alias → one target chosen by a strategy). The difference: a routing model picks one model per request; an ensemble model calls all of them and combines the result. It's the gateway acting as an intelligent orchestrator, not just a proxy.
Why several cheap models can beat one expensive one
The idea behind ensembling is simple: independent answers that agree are more likely correct, and a judge can resolve the cases where they disagree — keeping the well-supported insight one model found and discarding the claim another hallucinated. On hard reasoning and research-style prompts, a panel of smaller, lower-cost models reconciled by a judge can approach the answer quality you'd otherwise pay a single frontier model for.
That reframes a procurement decision as a configuration decision:
- Reduce single-model variance and blind spots. No single model is the bottleneck. One model's bad day is outvoted by the others and caught by the judge.
- Cross-check hard tasks. Different models reach answers by different paths; consensus across them is a stronger signal than any one chain of thought.
- Spend where it counts. Run three inexpensive models plus a cheap judge instead of one premium model — same OpenAI-compatible call, a different cost/quality point.
It is not free. An N-member ensemble makes N + 1 upstream calls per request, so it costs more and has higher latency than a single direct model — time-to-first-token is roughly max(panel latency) + judge time. Ensembles are for high-value, latency-tolerant traffic where answer quality matters more than the last few cents or milliseconds — not for autocomplete or chat typeahead.
Self-ensemble: diversity from one provider key
You don't need contracts with multiple vendors to benefit. A self-ensemble uses the same model several times with different sampling settings — each panel member gets its own temperature and seed — so one provider key produces a diverse panel that the judge still reconciles. It's the lowest-friction way to try ensembling: point the panel at a model you already use, vary the sampling, add a judge.
How it works
- A client sends one request to the ensemble's model name — an ordinary OpenAI-compatible chat call.
- The gateway dispatches the prompt to every panel member concurrently, applying each member's own
temperature/seed. - Once enough members answer (see
min_responses), the gateway builds a synthesis prompt from the original request plus the panel's answers and calls the judge. - The caller receives the judge's single synthesized answer — under the ensemble's name, never a panel member's.
Two contracts matter for anyone wiring this into an app or an agent:
- Customer-facing identity is preserved.
response.modelis always the ensemble alias — never a panel member, the judge, or an upstream provider's raw model id. The panel composition is an implementation detail; the synthesized answer never names the models that produced it. - Usage is the real, aggregate cost. The response
usageobject is the sum of every panel call plus the judge — so the cost you see reflects the whole fan-out (itsprompt_tokenswill intentionally exceed what you sent). For deeper analysis, the gateway also emits one usage event per sub-call, labeledpanelorjudge, so the per-request breakdown shows up in your logs.
Streaming works too: the panel is collected server-side, then the judge's tokens stream back to the caller — so you keep a streaming UX on top of the fan-out.
Configure it in a few lines
An ensemble references existing direct models by name — a panel, a judge, and a couple of knobs. Via the AISIX Admin API:
{ "display_name": "council-of-three", "ensemble": { "panel": [ { "model": "kimi-k2", "temperature": 0.7 }, { "model": "deepseek-v3", "temperature": 0.7 }, { "model": "gemini-flash", "temperature": 0.9 } ], "judge": { "model": "deepseek-v3" }, "min_responses": 2, "timeout_ms": 45000 } }
| Field | What it controls |
|---|---|
panel[].model | A direct model the panel calls (repeat the same one for a self-ensemble). |
panel[].temperature / seed | Per-member sampling — the diversity source; overrides the request's temperature. |
judge.model | The model that synthesizes the final answer. Runs at a fixed low temperature (~0.2) for stable output; an optional synthesis_prompt overrides the default template. |
min_responses | How many panel answers are required before the judge runs. The request tolerates a slow or failing member as long as this many succeed. |
timeout_ms | A per-call deadline for each panel member and the judge. |
In AISIX Cloud, you build the same thing from the dashboard's Models page — a panel picker, a judge selector, and the per-member knobs — and each panel member and the judge keep their own rate limits, so an ensemble can't quietly burn a provider's quota N× faster than your configured caps. Callers then use council-of-three exactly like any other model name.
What ships in v1
Model ensembles are available now — on AISIX Cloud and in the open-source AISIX gateway. A few deliberate boundaries in this first release:
- Chat completions only. Ensembles run on
/v1/chat/completions. A request that carriestools(or a forcedtool_choice) is rejected with a400— broadcasting one tool call to N models yields N conflicting results a client can't reconcile. Use a direct or routing model for tool-using requests. - Direct models only. Panel members and the judge must be direct models in the same environment; ensembles don't nest inside ensembles or routing groups.
- Config-driven. The panel is curated by the operator; there's no per-request panel override. Clients select an ensemble only by its model name.
Frequently asked questions
What is a model ensemble in an AI gateway?
It's a virtual model that sends one prompt to several LLMs at once (the panel) and uses another LLM (the judge) to merge their answers into one. The gateway runs the fan-out and synthesis server-side, so the client makes a single OpenAI-compatible request and receives a single answer.
Does an ensemble cost more than calling one model?
Yes. An N-member ensemble makes N + 1 upstream calls (the panel plus the judge), so it costs more and adds latency. The gateway reports the aggregate usage so the cost is transparent. Use ensembles for high-value, latency-tolerant prompts, not high-volume cheap traffic.
Can I run an ensemble with a single provider key?
Yes — that's a self-ensemble. Point the panel at the same model several times with different temperature/seed values. You get a diverse panel and a judge-reconciled answer without onboarding new vendors.
How is token usage and cost reported for an ensemble?
The client-facing usage object is the sum of every panel call plus the judge, so prompt_tokens and total_tokens reflect the full fan-out. The gateway also emits one usage event per sub-call (labeled panel or judge) for the per-request breakdown in your logs.
Is an ensemble model OpenAI-compatible?
Yes. You call it by its model name through the standard /v1/chat/completions API and get a standard chat completion back — streaming included. No SDK changes; the panel and judge are invisible to the caller.
How is an ensemble different from model routing or failover?
Routing and failover pick one target per request (a primary, or a fallback when the primary fails). An ensemble calls every panel member and synthesizes their answers into one. Routing optimizes for resilience and traffic shaping; ensembling optimizes for answer quality.
Get started
Model ensembles are part of AISIX AI Gateway. Define a panel and a judge, point your existing OpenAI-compatible client at the ensemble's model name, and compare the synthesized answer to your current single-model setup on your hardest prompts.
Further reading
- Model Ensembles configuration guide — fields, synthesis, streaming, and constraints (in the open-source AISIX repo)
- AI Gateway product page
- Announcing AISIX: The AI-Native AI Gateway
- The Future of AI Gateways: From Proxy to Intelligent Orchestrator