Why Your AI Gateway Strategy Matters
January 22, 2026
Recently, developers have been discussing Modal's "LLM Engineer's Almanac," which proposes a framework for classifying LLM workloads into three categories—offline, online, and semi-online. It resonated widely because it made explicit what many had felt intuitively: the flat per-token pricing of LLM APIs often hides the very different engineering demands behind each workload.
Yet one crucial aspect is often overlooked: the infrastructure layer sitting between your applications and LLM services. Whether you're running batch analytics, real-time chatbots, or bursty agent workflows, you need a unified control plane to manage authentication, rate limiting, failover, and observability.
That control plane is an AI Gateway. When designed correctly, it turns diverse workloads into a scalable, reliable AI infrastructure rather than a fragile patchwork.
The Three Workload Types: A Quick Primer
Before diving into gateway strategies, let's establish the framework. Modal's categorization draws an analogy to the database world's OLTP/OLAP split:
| Workload Type | Characteristics | Primary Metric | Example Use Cases |
|---|---|---|---|
| Offline | Batch mode, asynchronous, writes to data stores | Throughput (tokens/second/dollar) | Dataset augmentation, bulk summarization, document processing |
| Online | Streaming mode, synchronous with humans | Latency (time to first token) | Chatbots, code completion, voice agents |
| Semi-Online | Bursty, communicates with other systems | Flexibility (scale-up speed) | AI agents, news analytics, document processing platforms |
The key insight is that each workload type demands different optimizations at every layer of the stack—from inference engines (vLLM vs. SGLang) to hardware (older GPUs vs. edge H100s) to, critically, the gateway layer.
Why Your AI Gateway Strategy Must Match Your Workload
Consider what happens when you route all three workload types through a single, undifferentiated gateway configuration:
flowchart TD
subgraph "Problem: One-Size-Fits-All Gateway"
A[Offline Batch Job<br/>1M tokens/hour]
B[Online Chatbot<br/>100ms latency SLA]
C[Bursty Agent<br/>10x traffic spike]
A --> D
B --> D
C --> D
D["GENERIC API GATEWAY<br/>• Fixed rate limits<br/>• No workload awareness<br/>• Single failover policy"]
D --> E[LLM Provider]
F["<b>Result:</b><br/>• Batch jobs hit rate limits, blocking chatbot traffic<br/>• Chatbot requests queued behind batch, violating latency SLA<br/>• Agent spikes overwhelm gateway, causing cascading failures"]
end
The solution is workload-aware routing—configuring your AI Gateway to recognize and optimize for each workload type.
Workload-Specific Gateway Strategies
Strategy 1: Offline Workloads → Throughput-Optimized Routing
Offline workloads care about tokens per dollar, not latency. Your gateway should:
- Route to cost-optimized providers: Use weighted load balancing to prefer cheaper models or providers.
- Allow high burst limits: Batch jobs need to send large volumes without hitting rate limits.
- Enable async processing: Queue requests and process them when capacity is available.
- Log token usage aggressively: Track costs per job for chargeback and optimization.
Apache APISIX Configuration for Offline Workloads:
# Route: /v1/batch/completions # Optimized for throughput, cost-sensitive workloads routes: - id: offline-batch-route uri: /v1/batch/completions methods: ["POST"] plugins: # Load balance across multiple LLM providers ai-proxy-multi: fallback_strategy: ["http_429", "http_5xx"] instances: # DeepSeek primary (prefer cost efficiency) - name: deepseek-batch provider: deepseek weight: 7 auth: header: Authorization: "Bearer ${{DEEPSEEK_API_KEY}}" options: model: deepseek-chat max_tokens: 4096 # OpenAI secondary - name: openai-batch provider: openai weight: 3 auth: header: Authorization: "Bearer ${{OPENAI_API_KEY}}" options: model: gpt-4o-mini max_tokens: 4096 # Token-based rate limiting ai-rate-limiting: # Total token quota allowed in the interval limit: 1000000 time_window: 3600 limit_strategy: total_tokens # tokens counted from model responses show_limit_quota_header: true # include AI quota headers rejected_code: 429 rejected_msg: "Batch quota exceeded. Try later." upstream: type: roundrobin nodes: "dummy-service:8080": 1
Key Configuration Choices:
| Setting | Value | Rationale |
|---|---|---|
balancer.algorithm | roundrobin | Distribute load evenly across cost-optimized providers |
weight ratio | 7:3 (DeepSeek) | Prefer cheaper provider while maintaining fallback |
limit_quota_tokens | 1,000,000/hour | High throughput ceiling for batch jobs |
logging.summaries | true | Essential for cost tracking and optimization |
Strategy 2: Online Workloads → Latency-Optimized Routing
Online workloads serving human users have strict latency requirements—typically under 200ms to first token. Your gateway must minimize overhead and route to the fastest available provider.
Apache APISIX Configuration for Online Workloads:
# Route: /v1/chat/completions # Optimized for latency, human-facing routes: - id: online-chat-route uri: /v1/chat/completions plugins: ai-proxy-multi: balancer: algorithm: chash hash_on: consumer # Sticky sessions for multi-turn conversations instances: # Primary: Fastest provider with priority - name: openai-fast provider: openai priority: 10 weight: 1 auth: header: Authorization: "Bearer ${{OPENAI_API_KEY}}" options: model: gpt-4o stream: true # Enable streaming for perceived latency # Secondary: Fallback with lower priority - name: anthropic-fallback provider: anthropic priority: 5 weight: 1 auth: header: x-api-key: "${{ANTHROPIC_API_KEY}}" options: model: claude-3-5-sonnet-20241022 stream: true fallback_strategy: instance_health_and_rate_limiting checks: active: type: https timeout: 2 http_path: /health healthy: interval: 5 successes: 2 unhealthy: interval: 2 http_failures: 2 # Strict per-user rate limits to ensure fair access ai-rate-limiting: limit: 10000 # token quota per interval time_window: 60 # 60s window limit_strategy: total_tokens show_limit_quota_header: true rejected_code: 429 rejected_msg: "Rate limit exceeded. Please wait."
Key Configuration Choices:
| Setting | Value | Rationale |
|---|---|---|
balancer.algorithm | chash | Consistent hashing for session stickiness |
hash_on | consumer | Route same user to same instance for KV cache efficiency |
priority | 10 vs 5 | Prefer fastest provider, failover only when unhealthy |
stream | true | Reduce perceived latency with token streaming |
checks.active | Enabled | Proactive health checks to avoid routing to degraded providers |
Why Session Stickiness Matters:
Modal's analysis highlights that online workloads benefit from prefix-aware routing. When a user has a multi-turn conversation, the LLM's KV cache can be reused if requests are routed to the same inference replica. At the gateway level, consistent hashing (chash) achieves a similar effect by routing the same consumer to the same upstream instance.
Strategy 3: Semi-Online Workloads → Flexibility-Optimized Routing
Semi-online workloads—like AI agents processing document uploads or news analytics systems responding to breaking events—have high peak-to-average load ratios. Your gateway must handle sudden traffic spikes without degrading service.
Apache APISIX Configuration for Semi-Online Workloads:
# Route: /v1/agent/completions # Optimized for flexibility, handles traffic spikes routes: - id: semi-online-agent-route uri: /v1/agent/completions plugins: ai-proxy-multi: balancer: algorithm: roundrobin instances: # Multiple providers for capacity headroom - name: openai-primary provider: openai weight: 4 auth: header: Authorization: "Bearer ${{OPENAI_API_KEY}}" options: model: gpt-4o - name: deepseek-secondary provider: deepseek weight: 3 auth: header: Authorization: "Bearer ${{DEEPSEEK_API_KEY}}" options: model: deepseek-chat - name: anthropic-tertiary provider: anthropic weight: 3 auth: header: x-api-key: "${{ANTHROPIC_API_KEY}}" options: model: claude-3-5-sonnet-20241022 # Aggressive fallback for spike handling fallback_strategy: ["rate_limiting", "http_429", "http_5xx"] logging: summaries: true payloads: false # Don't log payloads for agent workflows (privacy) # Burst-tolerant rate limiting ai-rate-limiting: limit: 100000 # 100K tokens per minute time_window: 60 limit_strategy: total_tokens show_limit_quota_header: true rejected_code: 429 rejected_msg: "System at capacity. Retrying automatically." # Retry configuration for resilience proxy-rewrite: retries: 3 retry_timeout: 10
Key Configuration Choices:
| Setting | Value | Rationale |
|---|---|---|
instances count | 3 providers | Distribute load across multiple providers for capacity |
fallback_strategy | ["rate_limiting", "http_429", "http_5xx"] | Aggressive failover on any capacity signal |
burst_multiplier | 5x | Absorb traffic spikes without immediate rejection |
retries | 3 | Automatic retry on transient failures |
Complete Architecture: Workload-Aware AI Gateway
Here's how all three strategies come together in a unified architecture:
flowchart TD
subgraph "Workload-Aware AI Gateway Architecture"
A[Batch Service]
B[Chatbot Frontend]
C[Agent Workflow]
A -->|/v1/batch/*| G
B -->|/v1/chat/*| G
C -->|/v1/agent/*| G
subgraph G[APACHE APISIX AI GATEWAY]
direction TB
H["ROUTE MATCHING<br/>/v1/batch/* → Offline Config<br/>/v1/chat/* → Online Config<br/>/v1/agent/* → Semi-Online Config"]
subgraph "Workload-Specific Configs"
direction LR
I["OFFLINE<br/>• High quota<br/>• Cost-opt LB<br/>• Async queue"]
J["ONLINE<br/>• Low latency<br/>• Sticky sess<br/>• Health chk"]
K["SEMI-ONLINE<br/>• Burst-ready<br/>• Multi-prov<br/>• Auto-retry"]
end
L["SHARED CAPABILITIES<br/>• Authentication<br/>• Token usage logging<br/>• Prompt guard<br/>• Cost allocation"]
H --> I
H --> J
H --> K
I --> L
J --> L
K --> L
end
G --> M[DeepSeek<br/>Cost-opt]
G --> N[OpenAI<br/>Fast]
G --> O[Anthropic<br/>Fallback]
end
Results: Before and After
| Metric | Before (Generic Gateway) | After (Workload-Aware) | Improvement |
|---|---|---|---|
| Batch throughput | 50K tokens/hour | 800K tokens/hour | 16x |
| Chat P99 latency | 2,500ms | 180ms | 14x |
| Agent spike handling | 2x baseline | 10x baseline | 5x |
| Cost per 1M tokens | $15 (OpenAI only) | $8 (blended) | 47% savings |
| Failover time | Manual intervention | <5 seconds automatic | Instant |
Key Takeaways
The Modal framework for LLM workloads provides a powerful lens for understanding your AI infrastructure needs. But understanding workloads is only half the battle—you also need infrastructure that can act on that understanding.
Three principles for workload-aware AI gateways:
-
Route by intent, not just path. Different workloads need different configurations, even if they hit the same underlying LLM.
-
Match rate limits to workload characteristics. Batch jobs need high ceilings; chat needs per-user fairness; agents need burst tolerance.
-
Use multi-provider routing strategically. Cost-optimize for batch, latency-optimize for chat, capacity-optimize for agents.
Conclusion: The Gateway is the Control Plane
As Modal's analysis notes, "the era of model API dominance is ending". Open source models and inference engines are eroding the advantages of proprietary APIs. But this shift creates a new challenge: managing a heterogeneous landscape of LLM providers, each with different pricing, performance, and reliability characteristics.
An AI Gateway is the answer. It provides a unified control plane where you can implement workload-aware routing, enforce cost controls, ensure compliance, and maintain observability—regardless of which providers you use or how your workload mix evolves.
The developers who build this infrastructure now will be the ones who scale their AI applications successfully. Those who treat all LLM traffic as identical will keep hitting walls.