Model Ensembles: Several Low-Cost LLMs, One Better Answer

June 17, 2026

Products

Picking one model for a hard prompt is a gamble. The frontier model is accurate but expensive and slow; the cheap model is fast but occasionally confidently wrong. Most teams pick one and live with the trade-off.

A model ensemble removes the choice. Today we're introducing model ensembles in AISIX AI Gateway — a new virtual-model type that asks several models the same question in parallel and has a judge model reconcile their answers into one. You call it like any other model; the gateway does the rest.

What is a model ensemble?

A model ensemble is a single, caller-visible model name that fans one /v1/chat/completions request out to a configured panel of models in parallel, then uses a judge model to synthesize their answers into one final response. The fan-out and synthesis happen entirely server-side in the gateway — your client sends one request and gets one answer back, through the same OpenAI-compatible API you already use.

TermWhat it is
PanelThe set of models that independently answer the prompt, in parallel. Two or more, any mix of providers.
JudgeA model that reads the panel's answers and writes the single final answer the caller receives.
Ensemble modelThe caller-visible alias that wires a panel + a judge together. Clients call it by name; they never see the panel.

It's the third virtual-model kind in AISIX, alongside direct models (one alias → one upstream) and routing models (one alias → one target chosen by a strategy). The difference: a routing model picks one model per request; an ensemble model calls all of them and combines the result. It's the gateway acting as an intelligent orchestrator, not just a proxy.

Why several cheap models can beat one expensive one

The idea behind ensembling is simple: independent answers that agree are more likely correct, and a judge can resolve the cases where they disagree — keeping the well-supported insight one model found and discarding the claim another hallucinated. On hard reasoning and research-style prompts, a panel of smaller, lower-cost models reconciled by a judge can approach the answer quality you'd otherwise pay a single frontier model for.

That reframes a procurement decision as a configuration decision:

  • Reduce single-model variance and blind spots. No single model is the bottleneck. One model's bad day is outvoted by the others and caught by the judge.
  • Cross-check hard tasks. Different models reach answers by different paths; consensus across them is a stronger signal than any one chain of thought.
  • Spend where it counts. Run three inexpensive models plus a cheap judge instead of one premium model — same OpenAI-compatible call, a different cost/quality point.

It is not free. An N-member ensemble makes N + 1 upstream calls per request, so it costs more and has higher latency than a single direct model — time-to-first-token is roughly max(panel latency) + judge time. Ensembles are for high-value, latency-tolerant traffic where answer quality matters more than the last few cents or milliseconds — not for autocomplete or chat typeahead.

Self-ensemble: diversity from one provider key

You don't need contracts with multiple vendors to benefit. A self-ensemble uses the same model several times with different sampling settings — each panel member gets its own temperature and seed — so one provider key produces a diverse panel that the judge still reconciles. It's the lowest-friction way to try ensembling: point the panel at a model you already use, vary the sampling, add a judge.

How it works

  1. A client sends one request to the ensemble's model name — an ordinary OpenAI-compatible chat call.
  2. The gateway dispatches the prompt to every panel member concurrently, applying each member's own temperature/seed.
  3. Once enough members answer (see min_responses), the gateway builds a synthesis prompt from the original request plus the panel's answers and calls the judge.
  4. The caller receives the judge's single synthesized answer — under the ensemble's name, never a panel member's.

Two contracts matter for anyone wiring this into an app or an agent:

  • Customer-facing identity is preserved. response.model is always the ensemble alias — never a panel member, the judge, or an upstream provider's raw model id. The panel composition is an implementation detail; the synthesized answer never names the models that produced it.
  • Usage is the real, aggregate cost. The response usage object is the sum of every panel call plus the judge — so the cost you see reflects the whole fan-out (its prompt_tokens will intentionally exceed what you sent). For deeper analysis, the gateway also emits one usage event per sub-call, labeled panel or judge, so the per-request breakdown shows up in your logs.

Streaming works too: the panel is collected server-side, then the judge's tokens stream back to the caller — so you keep a streaming UX on top of the fan-out.

Configure it in a few lines

An ensemble references existing direct models by name — a panel, a judge, and a couple of knobs. Via the AISIX Admin API:

{ "display_name": "council-of-three", "ensemble": { "panel": [ { "model": "kimi-k2", "temperature": 0.7 }, { "model": "deepseek-v3", "temperature": 0.7 }, { "model": "gemini-flash", "temperature": 0.9 } ], "judge": { "model": "deepseek-v3" }, "min_responses": 2, "timeout_ms": 45000 } }
FieldWhat it controls
panel[].modelA direct model the panel calls (repeat the same one for a self-ensemble).
panel[].temperature / seedPer-member sampling — the diversity source; overrides the request's temperature.
judge.modelThe model that synthesizes the final answer. Runs at a fixed low temperature (~0.2) for stable output; an optional synthesis_prompt overrides the default template.
min_responsesHow many panel answers are required before the judge runs. The request tolerates a slow or failing member as long as this many succeed.
timeout_msA per-call deadline for each panel member and the judge.

In AISIX Cloud, you build the same thing from the dashboard's Models page — a panel picker, a judge selector, and the per-member knobs — and each panel member and the judge keep their own rate limits, so an ensemble can't quietly burn a provider's quota N× faster than your configured caps. Callers then use council-of-three exactly like any other model name.

What ships in v1

Model ensembles are available now — on AISIX Cloud and in the open-source AISIX gateway. A few deliberate boundaries in this first release:

  • Chat completions only. Ensembles run on /v1/chat/completions. A request that carries tools (or a forced tool_choice) is rejected with a 400 — broadcasting one tool call to N models yields N conflicting results a client can't reconcile. Use a direct or routing model for tool-using requests.
  • Direct models only. Panel members and the judge must be direct models in the same environment; ensembles don't nest inside ensembles or routing groups.
  • Config-driven. The panel is curated by the operator; there's no per-request panel override. Clients select an ensemble only by its model name.

Frequently asked questions

What is a model ensemble in an AI gateway?

It's a virtual model that sends one prompt to several LLMs at once (the panel) and uses another LLM (the judge) to merge their answers into one. The gateway runs the fan-out and synthesis server-side, so the client makes a single OpenAI-compatible request and receives a single answer.

Does an ensemble cost more than calling one model?

Yes. An N-member ensemble makes N + 1 upstream calls (the panel plus the judge), so it costs more and adds latency. The gateway reports the aggregate usage so the cost is transparent. Use ensembles for high-value, latency-tolerant prompts, not high-volume cheap traffic.

Can I run an ensemble with a single provider key?

Yes — that's a self-ensemble. Point the panel at the same model several times with different temperature/seed values. You get a diverse panel and a judge-reconciled answer without onboarding new vendors.

How is token usage and cost reported for an ensemble?

The client-facing usage object is the sum of every panel call plus the judge, so prompt_tokens and total_tokens reflect the full fan-out. The gateway also emits one usage event per sub-call (labeled panel or judge) for the per-request breakdown in your logs.

Is an ensemble model OpenAI-compatible?

Yes. You call it by its model name through the standard /v1/chat/completions API and get a standard chat completion back — streaming included. No SDK changes; the panel and judge are invisible to the caller.

How is an ensemble different from model routing or failover?

Routing and failover pick one target per request (a primary, or a fallback when the primary fails). An ensemble calls every panel member and synthesizes their answers into one. Routing optimizes for resilience and traffic shaping; ensembling optimizes for answer quality.

Get started

Model ensembles are part of AISIX AI Gateway. Define a panel and a judge, point your existing OpenAI-compatible client at the ensemble's model name, and compare the synthesized answer to your current single-model setup on your hardest prompts.

Further reading

Tags: