Why Your AI Gateway Strategy Matters

Recently, developers have been discussing Modal's "LLM Engineer's Almanac," which proposes a framework for classifying LLM workloads into three categories—offline, online, and semi-online. It resonated widely because it made explicit what many had felt intuitively: the flat per-token pricing of LLM APIs often hides the very different engineering demands behind each workload.

Yet one crucial aspect is often overlooked: the infrastructure layer sitting between your applications and LLM services. Whether you're running batch analytics, real-time chatbots, or bursty agent workflows, you need a unified control plane to manage authentication, rate limiting, failover, and observability.

That control plane is an AI Gateway. When designed correctly, it turns diverse workloads into a scalable, reliable AI infrastructure rather than a fragile patchwork.

The Three Workload Types: A Quick Primer

Before diving into gateway strategies, let's establish the framework. Modal's categorization draws an analogy to the database world's OLTP/OLAP split:

Workload Type	Characteristics	Primary Metric	Example Use Cases
Offline	Batch mode, asynchronous, writes to data stores	Throughput (tokens/second/dollar)	Dataset augmentation, bulk summarization, document processing
Online	Streaming mode, synchronous with humans	Latency (time to first token)	Chatbots, code completion, voice agents
Semi-Online	Bursty, communicates with other systems	Flexibility (scale-up speed)	AI agents, news analytics, document processing platforms

The key insight is that each workload type demands different optimizations at every layer of the stack—from inference engines (vLLM vs. SGLang) to hardware (older GPUs vs. edge H100s) to, critically, the gateway layer.

Why Your AI Gateway Strategy Must Match Your Workload

Consider what happens when you route all three workload types through a single, undifferentiated gateway configuration:

flowchart TD
    subgraph "Problem: One-Size-Fits-All Gateway"
        A[Offline Batch Job<br/>1M tokens/hour]
        B[Online Chatbot<br/>100ms latency SLA]
        C[Bursty Agent<br/>10x traffic spike]

        A --> D
        B --> D
        C --> D

        D["GENERIC API GATEWAY<br/>• Fixed rate limits<br/>• No workload awareness<br/>• Single failover policy"]

        D --> E[LLM Provider]

        F["<b>Result:</b><br/>• Batch jobs hit rate limits, blocking chatbot traffic<br/>• Chatbot requests queued behind batch, violating latency SLA<br/>• Agent spikes overwhelm gateway, causing cascading failures"]
    end

The solution is workload-aware routing—configuring your AI Gateway to recognize and optimize for each workload type.

Workload-Specific Gateway Strategies

Strategy 1: Offline Workloads → Throughput-Optimized Routing

Offline workloads care about tokens per dollar, not latency. Your gateway should:

Route to cost-optimized providers: Use weighted load balancing to prefer cheaper models or providers.
Allow high burst limits: Batch jobs need to send large volumes without hitting rate limits.
Enable async processing: Queue requests and process them when capacity is available.
Log token usage aggressively: Track costs per job for chargeback and optimization.

Apache APISIX Configuration for Offline Workloads:

# Route: /v1/batch/completions
# Optimized for throughput, cost-sensitive workloads

routes:
  - id: offline-batch-route
    uri: /v1/batch/completions
    methods: ["POST"]
    plugins:
      # Load balance across multiple LLM providers
      ai-proxy-multi:
        fallback_strategy: ["http_429", "http_5xx"]
        instances:
          # DeepSeek primary (prefer cost efficiency)
          - name: deepseek-batch
            provider: deepseek
            weight: 7
            auth:
              header:
                Authorization: "Bearer ${{DEEPSEEK_API_KEY}}"
            options:
              model: deepseek-chat
              max_tokens: 4096
          # OpenAI secondary
          - name: openai-batch
            provider: openai
            weight: 3
            auth:
              header:
                Authorization: "Bearer ${{OPENAI_API_KEY}}"
            options:
              model: gpt-4o-mini
              max_tokens: 4096

      # Token-based rate limiting
      ai-rate-limiting:
        # Total token quota allowed in the interval
        limit: 1000000
        time_window: 3600
        limit_strategy: total_tokens     # tokens counted from model responses
        show_limit_quota_header: true    # include AI quota headers
        rejected_code: 429
        rejected_msg: "Batch quota exceeded. Try later."

    upstream:
      type: roundrobin
      nodes:
        "dummy-service:8080": 1

Key Configuration Choices:

Setting	Value	Rationale
`balancer.algorithm`	`roundrobin`	Distribute load evenly across cost-optimized providers
`weight` ratio	7:3 (DeepSeek )	Prefer cheaper provider while maintaining fallback
`limit_quota_tokens`	1,000,000/hour	High throughput ceiling for batch jobs
`logging.summaries`	`true`	Essential for cost tracking and optimization

Strategy 2: Online Workloads → Latency-Optimized Routing

Online workloads serving human users have strict latency requirements—typically under 200ms to first token. Your gateway must minimize overhead and route to the fastest available provider.

Apache APISIX Configuration for Online Workloads:

# Route: /v1/chat/completions
# Optimized for latency, human-facing

routes:
  - id: online-chat-route
    uri: /v1/chat/completions
    plugins:
      ai-proxy-multi:
        balancer:
          algorithm: chash
          hash_on: consumer  # Sticky sessions for multi-turn conversations
        instances:
          # Primary: Fastest provider with priority
          - name: openai-fast
            provider: openai
            priority: 10
            weight: 1
            auth:
              header:
                Authorization: "Bearer ${{OPENAI_API_KEY}}"
            options:
              model: gpt-4o
              stream: true  # Enable streaming for perceived latency
          # Secondary: Fallback with lower priority
          - name: anthropic-fallback
            provider: anthropic
            priority: 5
            weight: 1
            auth:
              header:
                x-api-key: "${{ANTHROPIC_API_KEY}}"
            options:
              model: claude-3-5-sonnet-20241022
              stream: true
        fallback_strategy: instance_health_and_rate_limiting
        checks:
          active:
            type: https
            timeout: 2
            http_path: /health
            healthy:
              interval: 5
              successes: 2
            unhealthy:
              interval: 2
              http_failures: 2

      # Strict per-user rate limits to ensure fair access
      ai-rate-limiting:
        limit: 10000            # token quota per interval
        time_window: 60         # 60s window
        limit_strategy: total_tokens
        show_limit_quota_header: true
        rejected_code: 429
        rejected_msg: "Rate limit exceeded. Please wait."

Key Configuration Choices:

Setting	Value	Rationale
`balancer.algorithm`	`chash`	Consistent hashing for session stickiness
`hash_on`	`consumer`	Route same user to same instance for KV cache efficiency
`priority`	10 vs 5	Prefer fastest provider, failover only when unhealthy
`stream`	`true`	Reduce perceived latency with token streaming
`checks.active`	Enabled	Proactive health checks to avoid routing to degraded providers

Why Session Stickiness Matters:

Modal's analysis highlights that online workloads benefit from prefix-aware routing. When a user has a multi-turn conversation, the LLM's KV cache can be reused if requests are routed to the same inference replica. At the gateway level, consistent hashing (chash) achieves a similar effect by routing the same consumer to the same upstream instance.

Strategy 3: Semi-Online Workloads → Flexibility-Optimized Routing

Semi-online workloads—like AI agents processing document uploads or news analytics systems responding to breaking events—have high peak-to-average load ratios. Your gateway must handle sudden traffic spikes without degrading service.

Apache APISIX Configuration for Semi-Online Workloads:

# Route: /v1/agent/completions
# Optimized for flexibility, handles traffic spikes

routes:
  - id: semi-online-agent-route
    uri: /v1/agent/completions
    plugins:
      ai-proxy-multi:
        balancer:
          algorithm: roundrobin
        instances:
          # Multiple providers for capacity headroom
          - name: openai-primary
            provider: openai
            weight: 4
            auth:
              header:
                Authorization: "Bearer ${{OPENAI_API_KEY}}"
            options:
              model: gpt-4o
          - name: deepseek-secondary
            provider: deepseek
            weight: 3
            auth:
              header:
                Authorization: "Bearer ${{DEEPSEEK_API_KEY}}"
            options:
              model: deepseek-chat
          - name: anthropic-tertiary
            provider: anthropic
            weight: 3
            auth:
              header:
                x-api-key: "${{ANTHROPIC_API_KEY}}"
            options:
              model: claude-3-5-sonnet-20241022
        # Aggressive fallback for spike handling
        fallback_strategy: ["rate_limiting", "http_429", "http_5xx"]
        logging:
          summaries: true
          payloads: false  # Don't log payloads for agent workflows (privacy)

      # Burst-tolerant rate limiting
      ai-rate-limiting:
        limit: 100000                 # 100K tokens per minute
        time_window: 60
        limit_strategy: total_tokens
        show_limit_quota_header: true
        rejected_code: 429
        rejected_msg: "System at capacity. Retrying automatically."

      # Retry configuration for resilience
      proxy-rewrite:
        retries: 3
        retry_timeout: 10

Key Configuration Choices:

Setting	Value	Rationale
`instances` count	3 providers	Distribute load across multiple providers for capacity
`fallback_strategy`	`["rate_limiting", "http_429", "http_5xx"]`	Aggressive failover on any capacity signal
`burst_multiplier`	5x	Absorb traffic spikes without immediate rejection
`retries`	3	Automatic retry on transient failures

Complete Architecture: Workload-Aware AI Gateway

Here's how all three strategies come together in a unified architecture:

flowchart TD
    subgraph "Workload-Aware AI Gateway Architecture"
        A[Batch Service]
        B[Chatbot Frontend]
        C[Agent Workflow]

        A -->|/v1/batch/*| G
        B -->|/v1/chat/*| G
        C -->|/v1/agent/*| G

        subgraph G[APACHE APISIX AI GATEWAY]
            direction TB
            H["ROUTE MATCHING<br/>/v1/batch/* → Offline Config<br/>/v1/chat/* → Online Config<br/>/v1/agent/* → Semi-Online Config"]

            subgraph "Workload-Specific Configs"
                direction LR
                I["OFFLINE<br/>• High quota<br/>• Cost-opt LB<br/>• Async queue"]
                J["ONLINE<br/>• Low latency<br/>• Sticky sess<br/>• Health chk"]
                K["SEMI-ONLINE<br/>• Burst-ready<br/>• Multi-prov<br/>• Auto-retry"]
            end

            L["SHARED CAPABILITIES<br/>• Authentication<br/>• Token usage logging<br/>• Prompt guard<br/>• Cost allocation"]

            H --> I
            H --> J
            H --> K

            I --> L
            J --> L
            K --> L
        end

        G --> M[DeepSeek<br/>Cost-opt]
        G --> N[OpenAI<br/>Fast]
        G --> O[Anthropic<br/>Fallback]
    end

Results: Before and After

Metric	Before (Generic Gateway)	After (Workload-Aware)	Improvement
Batch throughput	50K tokens/hour	800K tokens/hour	16x
Chat P99 latency	2,500ms	180ms	14x
Agent spike handling	2x baseline	10x baseline	5x
Cost per 1M tokens	$15 (OpenAI only)	$8 (blended)	47% savings
Failover time	Manual intervention	<5 seconds automatic	Instant

Key Takeaways

The Modal framework for LLM workloads provides a powerful lens for understanding your AI infrastructure needs. But understanding workloads is only half the battle—you also need infrastructure that can act on that understanding.

Three principles for workload-aware AI gateways:

Route by intent, not just path. Different workloads need different configurations, even if they hit the same underlying LLM.
Match rate limits to workload characteristics. Batch jobs need high ceilings; chat needs per-user fairness; agents need burst tolerance.
Use multi-provider routing strategically. Cost-optimize for batch, latency-optimize for chat, capacity-optimize for agents.

Conclusion: The Gateway is the Control Plane

As Modal's analysis notes, "the era of model API dominance is ending". Open source models and inference engines are eroding the advantages of proprietary APIs. But this shift creates a new challenge: managing a heterogeneous landscape of LLM providers, each with different pricing, performance, and reliability characteristics.

An AI Gateway is the answer. It provides a unified control plane where you can implement workload-aware routing, enforce cost controls, ensure compliance, and maintain observability—regardless of which providers you use or how your workload mix evolves.

The developers who build this infrastructure now will be the ones who scale their AI applications successfully. Those who treat all LLM traffic as identical will keep hitting walls.