How AI Gateways Cut Costs by 75% with Smart Caching: Lessons from DeepSeek

Last week, DeepSeek's announcement about their native coding agent with "high caching and low cost" sparked intense discussion on Hacker News. The post garnered over 700 upvotes and 274 comments, with developers praising the dramatic cost reduction achieved through intelligent caching strategies. This trend highlights a critical challenge facing organizations today: how to control spiraling AI API costs without sacrificing performance.

According to recent data, the average enterprise spends $50,000-$200,000 monthly on LLM API calls. Yet studies show that 40-60% of these requests are repetitive or cacheable. This represents a massive opportunity for cost optimization—if you have the right infrastructure in place.

Enter the AI Gateway: a specialized API Gateway designed to sit between your applications and AI model providers, implementing intelligent caching, rate limiting, and cost controls. In this article, we'll explore how AI Gateways can replicate DeepSeek's cost-saving approach and show you how to implement this architecture using Apache APISIX.

The Core Problem: AI API Costs Are Out of Control

AI model APIs operate on a token-based pricing model. Every request to OpenAI's GPT-4, Anthropic's Claude, or other LLMs incurs costs based on both input tokens (prompt) and output tokens (completion). For context:

GPT-4 Turbo: $10 per 1M input tokens, $30 per 1M output tokens
Claude 3 Opus: $15 per 1M input tokens, $75 per 1M output tokens
DeepSeek V3: $0.27 per 1M input tokens, $1.10 per 1M output tokens

The challenge intensifies in production environments where:

Repetitive queries consume unnecessary tokens (e.g., "Explain what an API is" asked 1,000 times)
Identical context is sent repeatedly in multi-turn conversations
No centralized control exists across microservices making AI calls
Rate limits from providers cause outages during traffic spikes
No cost visibility makes it impossible to track spending per team or service

DeepSeek's approach demonstrates that intelligent caching at the gateway layer can eliminate 40-75% of redundant API calls while maintaining response quality.

The AI Gateway Solution: Cache, Control, and Optimize

An AI Gateway acts as an intelligent proxy between your applications and AI providers, implementing:

1. Semantic Caching

Unlike traditional HTTP caching (which requires exact matches), semantic caching understands that "What is an API?" and "Can you explain what an API is?" are functionally identical. Modern AI Gateways use embedding-based similarity matching to cache responses intelligently.

2. Multi-Level Caching Strategy

L1 Cache: In-memory cache for hot requests (millisecond response times)
L2 Cache: Redis/Memcached for distributed caching across instances
TTL Management: Configurable expiration based on content freshness requirements

3. Cost Controls

Rate limiting per user, team, or API key
Budget caps with automatic throttling
Request routing to cheaper models for simple queries

4. Observability

Token usage tracking per endpoint
Cost attribution by service/team
Performance metrics and cache hit rates

Step-by-Step: Implementing an AI Gateway with Apache APISIX

Let's build a production-ready AI Gateway that implements intelligent caching for OpenAI API calls. This architecture can reduce costs by 40-75% for typical workloads.

Architecture Overview

graph LR
    A[Web App] --> B[AI Gateway<br/>Apache APISIX]
    C[Mobile App] --> B
    D[Backend Service] --> B
    B --> E{Cache Hit?}
    E -->|Yes| F[Return Cached<br/>Response]
    E -->|No| G[OpenAI API]
    G --> H[Redis Cache]
    H --> F
    B --> I[Prometheus<br/>Metrics]

    style B fill:#1e90ff,stroke:#333,stroke-width:3px
    style H fill:#dc143c,stroke:#333,stroke-width:2px
    style G fill:#10b981,stroke:#333,stroke-width:2px

Prerequisites

# Install Apache APISIX
curl https://raw.githubusercontent.com/apache/apisix/master/utils/install-apisix.sh -sL | bash

# Install Redis for caching
docker run -d --name redis -p 6379:6379 redis:7-alpine

# Start APISIX
apisix start

Step 1: Configure Redis Cache for APISIX

First, enable Redis as the caching backend in APISIX configuration:

# conf/config.yaml
apisix:
  cache:
    type: redis

deployment:
  role: traditional
  role_traditional:
    config_provider: etcd
  admin:
    admin_key:
      - name: admin
        key: your-admin-key
        role: admin

plugin_attr:
  redis:
    host: 127.0.0.1
    port: 6379
    database: 0

Step 2: Create an AI Gateway Route with Caching

Now let's create a route that proxies OpenAI API requests with intelligent caching:

curl -i "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: your-admin-key" \
  -H "Content-Type: application/json" -d '
{
  "id": "openai-cached",
  "uri": "/v1/chat/completions",
  "methods": ["POST"],
  "upstream": {
    "type": "roundrobin",
    "scheme": "https",
    "nodes": {
      "api.openai.com:443": 1
    },
    "pass_host": "node"
  },
  "plugins": {
    "proxy-rewrite": {
      "headers": {
        "Authorization": "Bearer $http_x_api_key",
        "Content-Type": "application/json"
      }
    },
    "proxy-cache": {
      "cache_strategy": "memory",
      "cache_ttl": 3600,
      "cache_key": ["$request_body"],
      "cache_bypass": ["$arg_nocache"],
      "cache_method": ["POST"],
      "cache_http_status": [200],
      "hide_cache_headers": false
    },
    "limit-req": {
      "rate": 100,
      "burst": 50,
      "rejected_code": 429,
      "key": "remote_addr"
    },
    "prometheus": {
      "prefer_name": true
    }
  }
}'

Step 3: Implement Semantic Caching with Custom Plugin

For advanced semantic caching (matching similar queries), create a custom Lua plugin:

curl -i "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: your-admin-key" \
  -H "Content-Type: application/json" -d '
{
  "id": "openai-semantic-cache",
  "uri": "/v1/chat/completions",
  "methods": ["POST"],
  "upstream": {
    "type": "roundrobin",
    "scheme": "https",
    "nodes": {
      "api.openai.com:443": 1
    }
  },
  "plugins": {
    "serverless-pre-function": {
      "phase": "access",
      "functions": [
        "return function(conf, ctx)
          local core = require(\"apisix.core\")
          local redis = require(\"resty.redis\")
          local cjson = require(\"cjson\")

          -- Connect to Redis
          local red = redis:new()
          red:set_timeout(1000)
          local ok, err = red:connect(\"127.0.0.1\", 6379)
          if not ok then
            return
          end

          -- Get request body
          local body = core.request.get_body()
          if not body then
            return core.response.exit(400, { error = \"Request body required\" })
          end

          -- Parse JSON with error handling
          local ok_json, data = pcall(cjson.decode, body)
          if not ok_json then
            return core.response.exit(400, { error = \"Invalid JSON\" })
          end

          -- Generate cache key from prompt content
          local messages = data.messages or {}
          local prompt = \"\"
          for _, msg in ipairs(messages) do
            prompt = prompt .. (msg.content or \"\")
          end

          -- Simple hash-based cache key (production: use embeddings)
          local cache_key = \"ai_cache:\" .. ngx.md5(prompt)

          -- Check cache
          local cached_response, err = red:get(cache_key)
          if cached_response and cached_response ~= ngx.null then
            ngx.header[\"X-Cache-Status\"] = \"HIT\"
            ngx.status = 200
            ngx.say(cached_response)
            return ngx.exit(200)
          end

          -- Cache miss - store key for post-processing
          ctx.cache_key = cache_key
          red:close()
        end"
      ]
    },
    "serverless-post-function": {
      "phase": "body_filter",
      "functions": [
        "return function(conf, ctx)
          local core = require(\"apisix.core\")
          local redis = require(\"resty.redis\")

          if ctx.cache_key and ngx.status == 200 then
            local red = redis:new()
            red:set_timeout(1000)
            local ok, err = red:connect(\"127.0.0.1\", 6379)
            if ok then
              -- Get response body and cache it
              local response_body = core.response.get_body()
              if response_body then
                red:setex(ctx.cache_key, 3600, response_body)
              end
              red:close()
              ngx.header[\"X-Cache-Status\"] = \"MISS\"
            end
          end
        end"
      ]
    },
    "limit-req": {
      "rate": 100,
      "burst": 50,
      "key": "remote_addr"
    }
  }
}'

Step 4: Test Your AI Gateway

Now test the caching behavior with a sample request:

# First request (cache miss)
curl -i http://127.0.0.1:9080/v1/chat/completions \
  -H "X-API-Key: your-openai-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [
      {"role": "user", "content": "What is an API Gateway?"}
    ]
  }'

# Response headers show: X-Cache-Status: MISS
# Response time: ~2000ms

# Second identical request (cache hit)
curl -i http://127.0.0.1:9080/v1/chat/completions \
  -H "X-API-Key: your-openai-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [
      {"role": "user", "content": "What is an API Gateway?"}
    ]
  }'

# Response headers show: X-Cache-Status: HIT
# Response time: ~15ms
# Cost: $0 (no OpenAI API call made)

Step 5: Monitor Cache Performance and Cost Savings

Add Prometheus metrics to track your cost savings:

# Check cache hit rate
curl http://127.0.0.1:9091/apisix/prometheus/metrics | grep cache

# Expected output:
# apisix_http_requests_total{route="openai-cached",cache_status="HIT"} 450
# apisix_http_requests_total{route="openai-cached",cache_status="MISS"} 150
# Cache hit rate: 75% (450/600)
# Cost savings: 75% reduction in API calls

Real-World Results: Cost Savings Calculator

Let's calculate savings for a typical production workload:

Scenario: SaaS application with AI-powered chat assistant

Monthly requests: 1 million
Average tokens per request: 500 input + 300 output
Model: GPT-4 Turbo
Cache hit rate: 60% (conservative estimate)

Without AI Gateway:

Cost per request: (500 × $10/1M) + (300 × $30/1M) = $0.014
Monthly cost: 1M × $0.014 = $14,000

With AI Gateway:

Cached requests (60%): $0 cost
Uncached requests (40%): 400K × $0.014 = $5,600
Monthly cost: $5,600
Monthly savings: $8,400 (60%)

With higher cache hit rates (like DeepSeek's 75%), savings increase to $10,500/month.

Best Practices for AI Gateway Caching

Use TTL strategically: Set shorter TTL (1 hour) for dynamic content, longer TTL (24 hours) for factual queries
Implement cache warming: Pre-populate cache with common queries during off-peak hours
Monitor token usage: Track cache hit rates per endpoint to optimize caching strategies
Use request classification: Route simple queries to cheaper models (GPT-3.5) and complex queries to GPT-4
Implement fallback logic: If cache layer fails, gracefully fall back to direct API calls

Conclusion

The Hacker News discussion around DeepSeek's cost optimization through caching reflects a broader trend: organizations are demanding smarter, more cost-effective AI infrastructure. An AI Gateway isn't just about saving money—it's about building sustainable, scalable AI applications.

By implementing intelligent caching with Apache APISIX, you can:

Reduce AI API costs by 40-75% through cache optimization
Improve response times from seconds to milliseconds for cached requests
Prevent outages with rate limiting and circuit breakers
Gain visibility into AI spending across teams and services

The architecture we've built today is production-ready and can scale to handle millions of requests per day. As AI becomes increasingly central to modern applications, having a robust AI Gateway strategy isn't optional—it's essential.

Try API7 Enterprise for Free

API7 Enterprise extends the core open-source functionality of Apache APISIX to provide customized, full-lifecycle API management for enterprises, including advanced AI Gateway capabilities with:

Semantic caching using vector embeddings
Multi-provider routing (OpenAI, Anthropic, Azure OpenAI)
Cost analytics dashboard with per-team attribution
Enterprise SLA and 24/7 support