How API Gateways Proxy LLM Requests: Architecture, Best Practices, and Real-World Examples

Introduction

Large Language Models (LLMs), such as OpenAI's GPT-4 and DeepSeek's models, are increasingly used in applications ranging from AI assistants to code generation platforms. However, LLM APIs introduce unique challenges: they are rate-limited, compute-intensive, and occasionally unstable. API gateways can help software teams manage, secure, and optimize LLM requests in production systems.

This article explains how API gateways proxy LLM requests, including architectural design, plugin integration, flow control mechanisms, and real-world use cases. We will also explore how Apache APISIX can be configured to handle LLM workloads effectively.

Core Concepts and Challenges of Proxying LLM APIs

LLM-Specific API Characteristics

High latency: LLM responses may take seconds to return.
Token-based billing: Most LLM providers charge based on token usage.
Rate limits: Strict per-minute or per-second rate limits.
Retry strategies: Occasional transient errors due to overuse.
Streaming responses: SSE (Server-Sent Events) or chunked responses.

Challenges in Proxying LLM Requests

Managing retries with exponential backoff.
Enforcing request/response timeout thresholds.
Handling SSE and preserving response streaming.
Differentiating between upstream APIs (OpenAI vs. DeepSeek).

Architectural Overview: API Gateway as LLM Proxy

Key Responsibilities of API Gateway for LLM

Traffic routing based on model or provider.
Authentication management (e.g., API keys for OpenAI, DeepSeek).
Rate limiting per client or tenant.
Caching for non-dynamic prompts.
Observability: Logging, tracing, metrics.
Failover and fallback mechanisms.

Request Lifecycle Through API Gateway

sequenceDiagram
  participant Client
  participant API Gateway
  participant OpenAI
  Client->>API Gateway: POST /v1/chat/completions
  API Gateway->>API Gateway: Validate key, apply rate limiting
  API Gateway->>OpenAI: Forward request with upstream credentials
  OpenAI-->>API Gateway: LLM response (possibly streaming)
  API Gateway-->>Client: Stream response

Using Apache APISIX to Proxy LLM Requests

Apache APISIX provides rich plugin capabilities suitable for LLM workloads.

Key Plugins

ai-proxy: The ai-proxy plugin simplifies access to LLM and embedding models by transforming plugin configurations into the designated request format. It supports the integration with OpenAI, DeepSeek, and other OpenAI-compatible APIs.
ai-rate-limiting: The ai-rate-limiting plugin enforces token-based rate limiting for requests sent to LLM services. It helps manage API usage by controlling the number of tokens consumed within a specified time frame, ensuring fair resource allocation and preventing excessive load on the service. It is often used with ai-proxy-multi plugin.
ai-request-rewrite: The ai-request-rewrite plugin processes client requests by forwarding them to LLM services for transformation before relaying them to upstream services. This enables LLM-powered modifications such as data redaction, content enrichment, or reformatting. The plugin supports the integration with OpenAI, DeepSeek, and other OpenAI-compatible APIs.
ai-aws-content-moderation: The ai-aws-content-moderation plugin supports the integration with AWS Comprehend to check request bodies for toxicity when proxying to LLMs, such as profanity, hate speech, insult, harassment, violence, and more, rejecting requests if the evaluated outcome exceeds the configured threshold.

Example: Proxy to OpenAI

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: ${ADMIN_API_KEY}" \
  -d '{
    "id": "ai-proxy-route",
    "uri": "/anything",
    "methods": ["POST"],
    "plugins": {
      "ai-proxy": {
        "provider": "openai",
        "auth": {
          "header": {
            "Authorization": "Bearer '"$OPENAI_API_KEY"'"
          }
        },
        "options":{
          "model": "gpt-4"
        }
      }
    }
  }'

Failover Between OpenAI and DeepSeek

sequenceDiagram
  participant Client
  participant API Gateway
  participant OpenAI
  participant DeepSeek
  Client->>API Gateway: POST /v1/chat/completions
  API Gateway->>OpenAI: Primary upstream call
  OpenAI-->>API Gateway: Error (e.g. 429)
  API Gateway->>DeepSeek: Retry as fallback
  DeepSeek-->>API Gateway: Success
  API Gateway-->>Client: Return DeepSeek response

Load Balance between LLM Instances

The following example demonstrates how you can configure two models for load balancing, forwarding 80% of the traffic to one instance and 20% to the other.

For demonstration and easier differentiation, you will be configuring one OpenAI instance and one DeepSeek instance as the upstream LLM services.

Create a route as such and update with your LLM providers, models, API keys, and endpoints if applicable:

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: ${ADMIN_API_KEY}" \
  -d '{
    "id": "ai-proxy-multi-route",
    "uri": "/anything",
    "methods": ["POST"],
    "plugins": {
      "ai-proxy-multi": {
        "instances": [
          {
            "name": "openai-instance",
            "provider": "openai",
            "weight": 8,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$OPENAI_API_KEY"'"
              }
            },
            "options": {
              "model": "gpt-4"
            }
          },
          {
            "name": "deepseek-instance",
            "provider": "deepseek",
            "weight": 2,
            "auth": {
              "header": {
                "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
              }
            },
            "options": {
              "model": "deepseek-chat"
            }
          }
        ]
      }
    }
  }'

Best Practices for Proxying LLM APIs

Use Token-Aware Rate Limiting

Avoid flat per-request rate limits.
Use per-user or per-token limits.

Enable Retry with Backoff

Use fallback_strategy in ai-proxy-multi plugin.
Detect status codes (429, 5xx) properly.

Preserve SSE/Streaming

Avoid plugins that buffer the full response.
Confirm Transfer-Encoding: chunked is preserved.

Differentiate Providers

Route requests to different upstreams by model name, header, or request pattern. Use Observability Plugins
Enable prometheus plugin for tracing.
Log response time and token usage for cost control.

Real-World Use Case: LLM Gateway for Multi-Agent System

Scenario:

AI platform using OpenAI GPT-4, fallback to DeepSeek.
Custom token quota per user.
Streaming response to frontend.

Benefits from API Gateway:

Reduce load on primary upstream.
Improve user experience by retrying automatically.
Add analytics based on token consumption.

Conclusion

As LLMs become critical to production applications, API gateways play a pivotal role in managing LLM API traffic reliably, efficiently, and securely. With tools like Apache APISIX and its LLM-compatible plugins, engineers can implement token-based rate limiting, intelligent retries, streaming proxy, and failover across multiple providers such as OpenAI and DeepSeek.

By architecting the API gateway layer with these capabilities, teams can build resilient AI-powered systems with controlled cost, robust error handling, and excellent developer experience.

Next Steps

Stay tuned for our upcoming column on the API gateway Guide, where you'll find the latest updates and insights!

Eager to deepen your knowledge about API gateways? Follow our Linkedin for valuable insights delivered straight to your inbox!

If you have any questions or need further assistance, feel free to contact API7 Experts.