How AI Gateways Cut Costs by 75% with Smart Caching: Lessons from DeepSeek
May 26, 2026
Last week, DeepSeek's announcement about their native coding agent with "high caching and low cost" sparked intense discussion on Hacker News. The post garnered over 700 upvotes and 274 comments, with developers praising the dramatic cost reduction achieved through intelligent caching strategies. This trend highlights a critical challenge facing organizations today: how to control spiraling AI API costs without sacrificing performance.
According to recent data, the average enterprise spends $50,000-$200,000 monthly on LLM API calls. Yet studies show that 40-60% of these requests are repetitive or cacheable. This represents a massive opportunity for cost optimization—if you have the right infrastructure in place.
Enter the AI Gateway: a specialized API Gateway designed to sit between your applications and AI model providers, implementing intelligent caching, rate limiting, and cost controls. In this article, we'll explore how AI Gateways can replicate DeepSeek's cost-saving approach and show you how to implement this architecture using Apache APISIX.
The Core Problem: AI API Costs Are Out of Control
AI model APIs operate on a token-based pricing model. Every request to OpenAI's GPT-4, Anthropic's Claude, or other LLMs incurs costs based on both input tokens (prompt) and output tokens (completion). For context:
- GPT-4 Turbo: $10 per 1M input tokens, $30 per 1M output tokens
- Claude 3 Opus: $15 per 1M input tokens, $75 per 1M output tokens
- DeepSeek V3: $0.27 per 1M input tokens, $1.10 per 1M output tokens
The challenge intensifies in production environments where:
- Repetitive queries consume unnecessary tokens (e.g., "Explain what an API is" asked 1,000 times)
- Identical context is sent repeatedly in multi-turn conversations
- No centralized control exists across microservices making AI calls
- Rate limits from providers cause outages during traffic spikes
- No cost visibility makes it impossible to track spending per team or service
DeepSeek's approach demonstrates that intelligent caching at the gateway layer can eliminate 40-75% of redundant API calls while maintaining response quality.
The AI Gateway Solution: Cache, Control, and Optimize
An AI Gateway acts as an intelligent proxy between your applications and AI providers, implementing:
1. Semantic Caching
Unlike traditional HTTP caching (which requires exact matches), semantic caching understands that "What is an API?" and "Can you explain what an API is?" are functionally identical. Modern AI Gateways use embedding-based similarity matching to cache responses intelligently.
2. Multi-Level Caching Strategy
- L1 Cache: In-memory cache for hot requests (millisecond response times)
- L2 Cache: Redis/Memcached for distributed caching across instances
- TTL Management: Configurable expiration based on content freshness requirements
3. Cost Controls
- Rate limiting per user, team, or API key
- Budget caps with automatic throttling
- Request routing to cheaper models for simple queries
4. Observability
- Token usage tracking per endpoint
- Cost attribution by service/team
- Performance metrics and cache hit rates
Step-by-Step: Implementing an AI Gateway with Apache APISIX
Let's build a production-ready AI Gateway that implements intelligent caching for OpenAI API calls. This architecture can reduce costs by 40-75% for typical workloads.
Architecture Overview
graph LR
A[Web App] --> B[AI Gateway<br/>Apache APISIX]
C[Mobile App] --> B
D[Backend Service] --> B
B --> E{Cache Hit?}
E -->|Yes| F[Return Cached<br/>Response]
E -->|No| G[OpenAI API]
G --> H[Redis Cache]
H --> F
B --> I[Prometheus<br/>Metrics]
style B fill:#1e90ff,stroke:#333,stroke-width:3px
style H fill:#dc143c,stroke:#333,stroke-width:2px
style G fill:#10b981,stroke:#333,stroke-width:2px
Prerequisites
# Install Apache APISIX curl https://raw.githubusercontent.com/apache/apisix/master/utils/install-apisix.sh -sL | bash # Install Redis for caching docker run -d --name redis -p 6379:6379 redis:7-alpine # Start APISIX apisix start
Step 1: Configure Redis Cache for APISIX
First, enable Redis as the caching backend in APISIX configuration:
# conf/config.yaml apisix: cache: type: redis deployment: role: traditional role_traditional: config_provider: etcd admin: admin_key: - name: admin key: your-admin-key role: admin plugin_attr: redis: host: 127.0.0.1 port: 6379 database: 0
Step 2: Create an AI Gateway Route with Caching
Now let's create a route that proxies OpenAI API requests with intelligent caching:
curl -i "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ -H "X-API-KEY: your-admin-key" \ -H "Content-Type: application/json" -d ' { "id": "openai-cached", "uri": "/v1/chat/completions", "methods": ["POST"], "upstream": { "type": "roundrobin", "scheme": "https", "nodes": { "api.openai.com:443": 1 }, "pass_host": "node" }, "plugins": { "proxy-rewrite": { "headers": { "Authorization": "Bearer $http_x_api_key", "Content-Type": "application/json" } }, "proxy-cache": { "cache_strategy": "memory", "cache_ttl": 3600, "cache_key": ["$request_body"], "cache_bypass": ["$arg_nocache"], "cache_method": ["POST"], "cache_http_status": [200], "hide_cache_headers": false }, "limit-req": { "rate": 100, "burst": 50, "rejected_code": 429, "key": "remote_addr" }, "prometheus": { "prefer_name": true } } }'
Step 3: Implement Semantic Caching with Custom Plugin
For advanced semantic caching (matching similar queries), create a custom Lua plugin:
curl -i "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ -H "X-API-KEY: your-admin-key" \ -H "Content-Type: application/json" -d ' { "id": "openai-semantic-cache", "uri": "/v1/chat/completions", "methods": ["POST"], "upstream": { "type": "roundrobin", "scheme": "https", "nodes": { "api.openai.com:443": 1 } }, "plugins": { "serverless-pre-function": { "phase": "access", "functions": [ "return function(conf, ctx) local core = require(\"apisix.core\") local redis = require(\"resty.redis\") local cjson = require(\"cjson\") -- Connect to Redis local red = redis:new() red:set_timeout(1000) local ok, err = red:connect(\"127.0.0.1\", 6379) if not ok then return end -- Get request body local body = core.request.get_body() if not body then return core.response.exit(400, { error = \"Request body required\" }) end -- Parse JSON with error handling local ok_json, data = pcall(cjson.decode, body) if not ok_json then return core.response.exit(400, { error = \"Invalid JSON\" }) end -- Generate cache key from prompt content local messages = data.messages or {} local prompt = \"\" for _, msg in ipairs(messages) do prompt = prompt .. (msg.content or \"\") end -- Simple hash-based cache key (production: use embeddings) local cache_key = \"ai_cache:\" .. ngx.md5(prompt) -- Check cache local cached_response, err = red:get(cache_key) if cached_response and cached_response ~= ngx.null then ngx.header[\"X-Cache-Status\"] = \"HIT\" ngx.status = 200 ngx.say(cached_response) return ngx.exit(200) end -- Cache miss - store key for post-processing ctx.cache_key = cache_key red:close() end" ] }, "serverless-post-function": { "phase": "body_filter", "functions": [ "return function(conf, ctx) local core = require(\"apisix.core\") local redis = require(\"resty.redis\") if ctx.cache_key and ngx.status == 200 then local red = redis:new() red:set_timeout(1000) local ok, err = red:connect(\"127.0.0.1\", 6379) if ok then -- Get response body and cache it local response_body = core.response.get_body() if response_body then red:setex(ctx.cache_key, 3600, response_body) end red:close() ngx.header[\"X-Cache-Status\"] = \"MISS\" end end end" ] }, "limit-req": { "rate": 100, "burst": 50, "key": "remote_addr" } } }'
Step 4: Test Your AI Gateway
Now test the caching behavior with a sample request:
# First request (cache miss) curl -i http://127.0.0.1:9080/v1/chat/completions \ -H "X-API-Key: your-openai-key" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4", "messages": [ {"role": "user", "content": "What is an API Gateway?"} ] }' # Response headers show: X-Cache-Status: MISS # Response time: ~2000ms # Second identical request (cache hit) curl -i http://127.0.0.1:9080/v1/chat/completions \ -H "X-API-Key: your-openai-key" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4", "messages": [ {"role": "user", "content": "What is an API Gateway?"} ] }' # Response headers show: X-Cache-Status: HIT # Response time: ~15ms # Cost: $0 (no OpenAI API call made)
Step 5: Monitor Cache Performance and Cost Savings
Add Prometheus metrics to track your cost savings:
# Check cache hit rate curl http://127.0.0.1:9091/apisix/prometheus/metrics | grep cache # Expected output: # apisix_http_requests_total{route="openai-cached",cache_status="HIT"} 450 # apisix_http_requests_total{route="openai-cached",cache_status="MISS"} 150 # Cache hit rate: 75% (450/600) # Cost savings: 75% reduction in API calls
Real-World Results: Cost Savings Calculator
Let's calculate savings for a typical production workload:
Scenario: SaaS application with AI-powered chat assistant
- Monthly requests: 1 million
- Average tokens per request: 500 input + 300 output
- Model: GPT-4 Turbo
- Cache hit rate: 60% (conservative estimate)
Without AI Gateway:
- Cost per request: (500 × $10/1M) + (300 × $30/1M) = $0.014
- Monthly cost: 1M × $0.014 = $14,000
With AI Gateway:
- Cached requests (60%): $0 cost
- Uncached requests (40%): 400K × $0.014 = $5,600
- Monthly cost: $5,600
- Monthly savings: $8,400 (60%)
With higher cache hit rates (like DeepSeek's 75%), savings increase to $10,500/month.
Best Practices for AI Gateway Caching
- Use TTL strategically: Set shorter TTL (1 hour) for dynamic content, longer TTL (24 hours) for factual queries
- Implement cache warming: Pre-populate cache with common queries during off-peak hours
- Monitor token usage: Track cache hit rates per endpoint to optimize caching strategies
- Use request classification: Route simple queries to cheaper models (GPT-3.5) and complex queries to GPT-4
- Implement fallback logic: If cache layer fails, gracefully fall back to direct API calls
Conclusion
The Hacker News discussion around DeepSeek's cost optimization through caching reflects a broader trend: organizations are demanding smarter, more cost-effective AI infrastructure. An AI Gateway isn't just about saving money—it's about building sustainable, scalable AI applications.
By implementing intelligent caching with Apache APISIX, you can:
- Reduce AI API costs by 40-75% through cache optimization
- Improve response times from seconds to milliseconds for cached requests
- Prevent outages with rate limiting and circuit breakers
- Gain visibility into AI spending across teams and services
The architecture we've built today is production-ready and can scale to handle millions of requests per day. As AI becomes increasingly central to modern applications, having a robust AI Gateway strategy isn't optional—it's essential.
Try API7 Enterprise for Free
API7 Enterprise extends the core open-source functionality of Apache APISIX to provide customized, full-lifecycle API management for enterprises, including advanced AI Gateway capabilities with:
- Semantic caching using vector embeddings
- Multi-provider routing (OpenAI, Anthropic, Azure OpenAI)
- Cost analytics dashboard with per-team attribution
- Enterprise SLA and 24/7 support
