How API Gateways Proxy LLM Requests: Architecture, Best Practices, and Real-World Examples
API7.ai
June 11, 2025
Introduction
Large Language Models (LLMs), such as OpenAI's GPT-4 and DeepSeek's models, are increasingly used in applications ranging from AI assistants to code generation platforms. However, LLM APIs introduce unique challenges: they are rate-limited, compute-intensive, and occasionally unstable. API gateways can help software teams manage, secure, and optimize LLM requests in production systems.
This article explains how API gateways proxy LLM requests, including architectural design, plugin integration, flow control mechanisms, and real-world use cases. We will also explore how Apache APISIX can be configured to handle LLM workloads effectively.
Core Concepts and Challenges of Proxying LLM APIs
LLM-Specific API Characteristics
- High latency: LLM responses may take seconds to return.
- Token-based billing: Most LLM providers charge based on token usage.
- Rate limits: Strict per-minute or per-second rate limits.
- Retry strategies: Occasional transient errors due to overuse.
- Streaming responses: SSE (Server-Sent Events) or chunked responses.
Challenges in Proxying LLM Requests
- Managing retries with exponential backoff.
- Enforcing request/response timeout thresholds.
- Handling SSE and preserving response streaming.
- Differentiating between upstream APIs (OpenAI vs. DeepSeek).
Architectural Overview: API Gateway as LLM Proxy
Key Responsibilities of API Gateway for LLM
- Traffic routing based on model or provider.
- Authentication management (e.g., API keys for OpenAI, DeepSeek).
- Rate limiting per client or tenant.
- Caching for non-dynamic prompts.
- Observability: Logging, tracing, metrics.
- Failover and fallback mechanisms.
Request Lifecycle Through API Gateway
sequenceDiagram participant Client participant API Gateway participant OpenAI Client->>API Gateway: POST /v1/chat/completions API Gateway->>API Gateway: Validate key, apply rate limiting API Gateway->>OpenAI: Forward request with upstream credentials OpenAI-->>API Gateway: LLM response (possibly streaming) API Gateway-->>Client: Stream response
Using Apache APISIX to Proxy LLM Requests
Apache APISIX provides rich plugin capabilities suitable for LLM workloads.
Key Plugins
-
ai-proxy: The
ai-proxy
plugin simplifies access to LLM and embedding models by transforming plugin configurations into the designated request format. It supports the integration with OpenAI, DeepSeek, and other OpenAI-compatible APIs. -
ai-rate-limiting: The
ai-rate-limiting
plugin enforces token-based rate limiting for requests sent to LLM services. It helps manage API usage by controlling the number of tokens consumed within a specified time frame, ensuring fair resource allocation and preventing excessive load on the service. It is often used with ai-proxy-multi plugin. -
ai-request-rewrite: The
ai-request-rewrite
plugin processes client requests by forwarding them to LLM services for transformation before relaying them to upstream services. This enables LLM-powered modifications such as data redaction, content enrichment, or reformatting. The plugin supports the integration with OpenAI, DeepSeek, and other OpenAI-compatible APIs. -
ai-aws-content-moderation: The
ai-aws-content-moderation
plugin supports the integration with AWS Comprehend to check request bodies for toxicity when proxying to LLMs, such as profanity, hate speech, insult, harassment, violence, and more, rejecting requests if the evaluated outcome exceeds the configured threshold.
Example: Proxy to OpenAI
curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ -H "X-API-KEY: ${ADMIN_API_KEY}" \ -d '{ "id": "ai-proxy-route", "uri": "/anything", "methods": ["POST"], "plugins": { "ai-proxy": { "provider": "openai", "auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } }, "options":{ "model": "gpt-4" } } } }'
Failover Between OpenAI and DeepSeek
sequenceDiagram participant Client participant API Gateway participant OpenAI participant DeepSeek Client->>API Gateway: POST /v1/chat/completions API Gateway->>OpenAI: Primary upstream call OpenAI-->>API Gateway: Error (e.g. 429) API Gateway->>DeepSeek: Retry as fallback DeepSeek-->>API Gateway: Success API Gateway-->>Client: Return DeepSeek response
Load Balance between LLM Instances
The following example demonstrates how you can configure two models for load balancing, forwarding 80% of the traffic to one instance and 20% to the other.
For demonstration and easier differentiation, you will be configuring one OpenAI instance and one DeepSeek instance as the upstream LLM services.
Create a route as such and update with your LLM providers, models, API keys, and endpoints if applicable:
curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ -H "X-API-KEY: ${ADMIN_API_KEY}" \ -d '{ "id": "ai-proxy-multi-route", "uri": "/anything", "methods": ["POST"], "plugins": { "ai-proxy-multi": { "instances": [ { "name": "openai-instance", "provider": "openai", "weight": 8, "auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } }, "options": { "model": "gpt-4" } }, { "name": "deepseek-instance", "provider": "deepseek", "weight": 2, "auth": { "header": { "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'" } }, "options": { "model": "deepseek-chat" } } ] } } }'
Best Practices for Proxying LLM APIs
Use Token-Aware Rate Limiting
- Avoid flat per-request rate limits.
- Use per-user or per-token limits.
Enable Retry with Backoff
- Use fallback_strategy in ai-proxy-multi plugin.
- Detect status codes (429, 5xx) properly.
Preserve SSE/Streaming
- Avoid plugins that buffer the full response.
- Confirm
Transfer-Encoding
: chunked is preserved.
Differentiate Providers
- Route requests to different upstreams by model name, header, or request pattern. Use Observability Plugins
- Enable prometheus plugin for tracing.
- Log response time and token usage for cost control.
Real-World Use Case: LLM Gateway for Multi-Agent System
Scenario:
- AI platform using OpenAI GPT-4, fallback to DeepSeek.
- Custom token quota per user.
- Streaming response to frontend.
Benefits from API Gateway:
- Reduce load on primary upstream.
- Improve user experience by retrying automatically.
- Add analytics based on token consumption.
Conclusion
As LLMs become critical to production applications, API gateways play a pivotal role in managing LLM API traffic reliably, efficiently, and securely. With tools like Apache APISIX and its LLM-compatible plugins, engineers can implement token-based rate limiting, intelligent retries, streaming proxy, and failover across multiple providers such as OpenAI and DeepSeek.
By architecting the API gateway layer with these capabilities, teams can build resilient AI-powered systems with controlled cost, robust error handling, and excellent developer experience.
Next Steps
Stay tuned for our upcoming column on the API gateway Guide, where you'll find the latest updates and insights!
Eager to deepen your knowledge about API gateways? Follow our Linkedin for valuable insights delivered straight to your inbox!
If you have any questions or need further assistance, feel free to contact API7 Experts.