Model Ensembles: Several Low-Cost LLMs, One Better Answer

Picking one model for a hard prompt is a gamble. The frontier model is accurate but expensive and slow; the cheap model is fast but occasionally confidently wrong. Most teams pick one and live with the trade-off.

A model ensemble removes the choice. Today we're introducing model ensembles in AISIX AI Gateway — a new virtual-model type that asks several models the same question in parallel and has a judge model reconcile their answers into one. You call it like any other model; the gateway does the rest.

What is a model ensemble?

A model ensemble is a single, caller-visible model name that fans one /v1/chat/completions request out to a configured panel of models in parallel, then uses a judge model to synthesize their answers into one final response. The fan-out and synthesis happen entirely server-side in the gateway — your client sends one request and gets one answer back, through the same OpenAI-compatible API you already use.

Term	What it is
Panel	The set of models that independently answer the prompt, in parallel. Two or more, any mix of providers.
Judge	A model that reads the panel's answers and writes the single final answer the caller receives.
Ensemble model	The caller-visible alias that wires a panel + a judge together. Clients call it by name; they never see the panel.

It's the third virtual-model kind in AISIX, alongside direct models (one alias → one upstream) and routing models (one alias → one target chosen by a strategy). The difference: a routing model picks one model per request; an ensemble model calls all of them and combines the result. It's the gateway acting as an intelligent orchestrator, not just a proxy.

Why several cheap models can beat one expensive one

The idea behind ensembling is simple: independent answers that agree are more likely correct, and a judge can resolve the cases where they disagree — keeping the well-supported insight one model found and discarding the claim another hallucinated. On hard reasoning and research-style prompts, a panel of smaller, lower-cost models reconciled by a judge can approach the answer quality you'd otherwise pay a single frontier model for.

That reframes a procurement decision as a configuration decision:

Reduce single-model variance and blind spots. No single model is the bottleneck. One model's bad day is outvoted by the others and caught by the judge.
Cross-check hard tasks. Different models reach answers by different paths; consensus across them is a stronger signal than any one chain of thought.
Spend where it counts. Run three inexpensive models plus a cheap judge instead of one premium model — same OpenAI-compatible call, a different cost/quality point.

It is not free. An N-member ensemble makes N + 1 upstream calls per request, so it costs more and has higher latency than a single direct model — time-to-first-token is roughly max(panel latency) + judge time. Ensembles are for high-value, latency-tolerant traffic where answer quality matters more than the last few cents or milliseconds — not for autocomplete or chat typeahead.

Self-ensemble: diversity from one provider key

You don't need contracts with multiple vendors to benefit. A self-ensemble uses the same model several times with different sampling settings — each panel member gets its own temperature and seed — so one provider key produces a diverse panel that the judge still reconciles. It's the lowest-friction way to try ensembling: point the panel at a model you already use, vary the sampling, add a judge.

How it works

A client sends one request to the ensemble's model name — an ordinary OpenAI-compatible chat call.
The gateway dispatches the prompt to every panel member concurrently, applying each member's own temperature/seed.
Once enough members answer (see min_responses), the gateway builds a synthesis prompt from the original request plus the panel's answers and calls the judge.
The caller receives the judge's single synthesized answer — under the ensemble's name, never a panel member's.

Two contracts matter for anyone wiring this into an app or an agent:

Customer-facing identity is preserved. response.model is always the ensemble alias — never a panel member, the judge, or an upstream provider's raw model id. The panel composition is an implementation detail; the synthesized answer never names the models that produced it.
Usage is the real, aggregate cost. The response usage object is the sum of every panel call plus the judge — so the cost you see reflects the whole fan-out (its prompt_tokens will intentionally exceed what you sent). For deeper analysis, the gateway also emits one usage event per sub-call, labeled panel or judge, so the per-request breakdown shows up in your logs.

Streaming works too: the panel is collected server-side, then the judge's tokens stream back to the caller — so you keep a streaming UX on top of the fan-out.

Configure it in a few lines

An ensemble references existing direct models by name — a panel, a judge, and a couple of knobs. Via the AISIX Admin API:

{
  "display_name": "council-of-three",
  "ensemble": {
    "panel": [
      { "model": "kimi-k2",      "temperature": 0.7 },
      { "model": "deepseek-v3",  "temperature": 0.7 },
      { "model": "gemini-flash", "temperature": 0.9 }
    ],
    "judge": { "model": "deepseek-v3" },
    "min_responses": 2,
    "timeout_ms": 45000
  }
}

Field	What it controls
`panel[].model`	A direct model the panel calls (repeat the same one for a self-ensemble).
`panel[].temperature` / `seed`	Per-member sampling — the diversity source; overrides the request's temperature.
`judge.model`	The model that synthesizes the final answer. Runs at a fixed low temperature (~0.2) for stable output; an optional `synthesis_prompt` overrides the default template.
`min_responses`	How many panel answers are required before the judge runs. The request tolerates a slow or failing member as long as this many succeed.
`timeout_ms`	A per-call deadline for each panel member and the judge.

In AISIX Cloud, you build the same thing from the dashboard's Models page — a panel picker, a judge selector, and the per-member knobs — and each panel member and the judge keep their own rate limits, so an ensemble can't quietly burn a provider's quota N× faster than your configured caps. Callers then use council-of-three exactly like any other model name.

What ships in v1

Model ensembles are available now — on AISIX Cloud and in the open-source AISIX gateway. A few deliberate boundaries in this first release:

Chat completions only. Ensembles run on /v1/chat/completions. A request that carries tools (or a forced tool_choice) is rejected with a 400 — broadcasting one tool call to N models yields N conflicting results a client can't reconcile. Use a direct or routing model for tool-using requests.
Direct models only. Panel members and the judge must be direct models in the same environment; ensembles don't nest inside ensembles or routing groups.
Config-driven. The panel is curated by the operator; there's no per-request panel override. Clients select an ensemble only by its model name.

Frequently asked questions

What is a model ensemble in an AI gateway?

It's a virtual model that sends one prompt to several LLMs at once (the panel) and uses another LLM (the judge) to merge their answers into one. The gateway runs the fan-out and synthesis server-side, so the client makes a single OpenAI-compatible request and receives a single answer.

Does an ensemble cost more than calling one model?

Yes. An N-member ensemble makes N + 1 upstream calls (the panel plus the judge), so it costs more and adds latency. The gateway reports the aggregate usage so the cost is transparent. Use ensembles for high-value, latency-tolerant prompts, not high-volume cheap traffic.

Can I run an ensemble with a single provider key?

Yes — that's a self-ensemble. Point the panel at the same model several times with different temperature/seed values. You get a diverse panel and a judge-reconciled answer without onboarding new vendors.

How is token usage and cost reported for an ensemble?

The client-facing usage object is the sum of every panel call plus the judge, so prompt_tokens and total_tokens reflect the full fan-out. The gateway also emits one usage event per sub-call (labeled panel or judge) for the per-request breakdown in your logs.

Is an ensemble model OpenAI-compatible?

Yes. You call it by its model name through the standard /v1/chat/completions API and get a standard chat completion back — streaming included. No SDK changes; the panel and judge are invisible to the caller.

How is an ensemble different from model routing or failover?

Routing and failover pick one target per request (a primary, or a fallback when the primary fails). An ensemble calls every panel member and synthesizes their answers into one. Routing optimizes for resilience and traffic shaping; ensembling optimizes for answer quality.

Get started

Model ensembles are part of AISIX AI Gateway. Define a panel and a judge, point your existing OpenAI-compatible client at the ensemble's model name, and compare the synthesized answer to your current single-model setup on your hardest prompts.