How to Fight Against AI Scrapers with an API Gateway

Last month, MetaBrainz—the nonprofit behind MusicBrainz and ListenBrainz—made a painful announcement: they were locking down their public APIs. The reason? AI scrapers had been hammering their servers so aggressively that legitimate users couldn't access the service. Rather than downloading the freely available data dumps, these bots were crawling the entire database one page at a time, a process that would take "hundreds of years to complete" and was "utterly pointless."

MetaBrainz's story is far from unique. Across the web, independent developers and small businesses are reporting eerily similar battles: a personal website maintained for over two decades, shut down after being overwhelmed; a passion project forced behind a paywall when its modest budget could no longer sustain the robotic traffic; companies watching their operational costs skyrocket while real users face frustrating timeouts and errors.

The common thread? Traditional defenses don't work anymore. Robots.txt is ignored. User-Agent filtering is trivially bypassed. IP blocking is a game of whack-a-mole against cloud providers with infinite IP ranges.

This post explores why AI scrapers are different from traditional bots, what actually works to stop them, and how to implement a multi-layered defense using Apache APISIX as your API Gateway.

Why AI Scrapers Are Different

Traditional web scrapers were typically operated by individuals or small companies with limited resources. They respected robots.txt because getting blocked meant starting over. They had identifiable patterns because they were optimized for specific targets.

AI scrapers operate under a fundamentally different model:

Characteristic	Traditional Scrapers	AI Scrapers
Scale	Targeted, specific sites	Everything, everywhere
Resources	Limited infrastructure	Backed by billion-dollar companies
Robots.txt	Generally respected	Frequently ignored
Goal	Extract specific data	Consume all available content
Efficiency	Optimized for speed	Brute-force, page-by-page
Accountability	Traceable operators	Anonymous cloud instances

The Five Layers of API Protection

Effective protection against AI scrapers requires a defense-in-depth approach. No single technique is sufficient; you need multiple layers working together.

Let's implement each layer using Apache APISIX.

Prerequisites

Install Docker to be used in the quickstart script to create containerized etcd and APISIX.
Install cURL to be used in the quickstart script and to send requests to APISIX for verification.

Step 1: Get APISIX

APISIX can be easily installed and started with the quickstart script:

curl -sL https://run.api7.ai/apisix/quickstart | sh

You will see the following message once APISIX is ready:

✔ APISIX is ready!

Step 2: Layer 1 - IP-Based Filtering

First, let's block known problematic IP ranges. Azure and certain cloud providers are frequently mentioned as sources of aggressive scraping.

Create a route with IP restrictions:

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: ${admin_key}" \
  -d '{
    "id": "protected-api",
    "uri": "/api/*",
    "upstream": {
      "type": "roundrobin",
      "nodes": {
        "your-backend:8080": 1
      }
    },
    "plugins": {
      "ip-restriction": {
        "blacklist": [
          "20.0.0.0/8",
          "40.74.0.0/15",
          "52.224.0.0/11"
        ],
        "message": "Access denied. If you believe this is an error, please contact support."
      }
    }
  }'

Important: This is a blunt instrument. Use it carefully and consider your legitimate users who might be on cloud platforms.

Step 3: Layer 2 - User-Agent Validation

While User-Agent strings can be spoofed, many scrapers don't bother. Let's block known AI crawler User-Agents:

curl "http://127.0.0.1:9180/apisix/admin/routes/protected-api" -X PATCH \
  -H "X-API-KEY: your-secure-admin-key" \
  -d '{
    "plugins": {
      "ua-restriction": {
        "bypass_missing": false,
        "denylist": [
          "GPTBot",
          "ChatGPT-User",
          "CCBot",
          "anthropic-ai",
          "Claude-Web",
          "Bytespider",
          "PetalBot",
          "Amazonbot",
          "meta-externalagent",
          "FacebookBot",
          "cohere-ai",
          "PerplexityBot",
          "Applebot-Extended"
        ],
        "message": "AI scrapers are not permitted. Please use our data dump at /downloads/data.tar.gz"
      }
    }
  }'

Notice the custom message—this implements the "coordination" solution discussed on Hacker News. We're telling scrapers where to find bulk data.

Step 4: Layer 3 - Intelligent Rate Limiting

This is where APISIX really shines. We'll implement a two-tier rate limiting strategy:

Burst protection: Allow short bursts for legitimate users
Sustained rate limiting: Prevent long-term abuse

curl "http://127.0.0.1:9180/apisix/admin/routes/protected-api" -X PATCH \
  -H "X-API-KEY: ${admin_key}" \
  -d '{
    "plugins": {
      "limit-req": {
        "rate": 10,
        "burst": 20,
        "key_type": "var",
        "key": "remote_addr",
        "rejected_code": 429,
        "rejected_msg": "{\"error\": \"Rate limit exceeded. For bulk data access, please download our data dump at /downloads/data.tar.gz\", \"retry_after\": 60}"
      },
      "limit-count": {
        "count": 1000,
        "time_window": 3600,
        "key_type": "var",
        "key": "remote_addr",
        "rejected_code": 429,
        "policy": "local"
      }
    }
  }'

What this does:

limit-req: Allows 10 requests per second with bursts up to 20
limit-count: Caps total requests at 1,000 per hour per IP

The custom error message includes a retry_after value and points to the bulk download—addressing the Hacker News discussion about whether programs read 429 response bodies.

Step 5: Layer 4 - API Key Authentication

For sensitive endpoints, require authentication. This is exactly what MetaBrainz did with their /metadata/lookup endpoint.

First, create a consumer:

curl "http://127.0.0.1:9180/apisix/admin/consumers" -X PUT \
  -H "X-API-KEY: ${admin_key}" \
  -d '{
    "username": "legitimate-user",
    "plugins": {
      "key-auth": {
        "key": "user-specific-api-key-here"
      }
    }
  }'

Then protect specific endpoints:

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: ${admin_key}" \
  -d '{
    "id": "authenticated-api",
    "uri": "/api/metadata/*",
    "upstream": {
      "type": "roundrobin",
      "nodes": {
        "your-backend:8080": 1
      }
    },
    "plugins": {
      "key-auth": {
        "header": "Authorization",
        "query": "api_key"
      },
      "limit-count": {
        "count": 10000,
        "time_window": 3600,
        "key_type": "var",
        "key": "consumer_name",
        "rejected_code": 429,
        "policy": "local"
      }
    }
  }'

Now authenticated users get higher rate limits (10,000/hour vs 1,000/hour for anonymous), and you can track usage per user.

Step 6: Layer 5 - Honeypot Detection

This is the clever part. Create a hidden endpoint that legitimate users would never access, but scrapers following links blindly will hit:

curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
  -H "X-API-KEY: ${admin_key}" \
  -d '{
    "id": "honeypot",
    "uri": "/api/internal/debug-data",
    "plugins": {
      "serverless-pre-function": {
        "phase": "access",
        "functions": [
          "return function(conf, ctx) local core = require(\"apisix.core\"); local ip = core.request.get_remote_client_ip(ctx); core.log.warn(\"HONEYPOT HIT from IP: \", ip); return 403, {error = \"Access denied\"} end"
        ]
      }
    }
  }'

Add a hidden link to this endpoint in your HTML (invisible to users but visible to scrapers):

<a href="/api/internal/debug-data" style="display:none">Debug Data</a>

Any IP that hits this endpoint is almost certainly a scraper. You can then add it to your blocklist automatically.

Step 7: Implement Smart 429 Responses

Following the Hacker News discussion about retry-after headers [2], let's configure proper 429 responses:

curl "http://127.0.0.1:9180/apisix/admin/routes/protected-api" -X PATCH \
  -H "X-API-KEY: ${admin_key}" \
  -d '{
    "plugins": {
      "response-rewrite": {
        "headers": {
          "set": {
            "Retry-After": "60",
            "X-RateLimit-Limit": "1000",
            "X-RateLimit-Remaining": "$limit_count_remaining",
            "X-Bulk-Download": "/downloads/data.tar.gz"
          }
        }
      }
    }
  }'

The custom X-Bulk-Download header implements the "coordination" solution—telling well-behaved bots where to find bulk data.

Monitoring and Observability

Protection is only half the battle. You need to know what's happening. Enable Prometheus metrics to track:

Request rates by IP
429 responses over time
Blocked User-Agents
Honeypot hits

Access metrics at http://127.0.0.1:9091/apisix/prometheus/metrics:

# HELP apisix_http_status HTTP status codes
# TYPE apisix_http_status counter
apisix_http_status{code="429",route="protected-api"} 15234
apisix_http_status{code="200",route="protected-api"} 89421

# HELP apisix_bandwidth Total bandwidth
# TYPE apisix_bandwidth counter
apisix_bandwidth{type="ingress",route="protected-api"} 1234567890

The Coordination Solution: robots.txt 2.0

One of the most insightful points from the Hacker News discussion was the need for a new standard—a way for websites to say "here's bulk data, use this instead of scraping".

While we wait for that standard, you can implement it yourself using APISIX:

Create a well-known endpoint: /well-known/bulk-data.json
Return structured information about available data dumps
Reference it in your robots.txt

{
  "bulk_data_available": true,
  "downloads": [
    {
      "name": "Full Database Dump",
      "url": "/downloads/full-dump-2026-01-14.tar.gz",
      "format": "tar.gz",
      "size_bytes": 1234567890,
      "updated": "2026-01-14T00:00:00Z",
      "checksum_sha256": "abc123..."
    }
  ],
  "api_documentation": "/docs/api",
  "rate_limits": {
    "anonymous": "1000 requests/hour",
    "authenticated": "10000 requests/hour"
  },
  "contact": "api-support@example.com"
}

Add to your robots.txt:

User-agent: *
Disallow: /api/internal/

# Bulk data available - please use instead of scraping
# See: /well-known/bulk-data.json

Results: Before and After

Here's what you can expect after implementing these protections:

Metric	Before	After
Scraper traffic	60% of requests	< 5% of requests
Server load	Constant spikes	Stable, predictable
Infrastructure cost	Unpredictable	Reduced by 40-60%
Legitimate user experience	Degraded during attacks	Consistent
Visibility into abuse	None	Complete

Conclusion: The New Reality of API Protection

The era of trusting robots.txt is over. AI companies have made it clear that they'll scrape first and ask questions never. As one Hacker News commenter observed:

"We can't have nice things because the powers that be decided that adtech money was worth far more than efficiency, interoperability, and things like user privacy and autonomy."

But you're not powerless. An API Gateway like Apache APISIX gives you the tools to fight back:

Multi-layer defense that doesn't rely on any single technique.
Intelligent rate limiting that distinguishes humans from bots.
Observability to understand what's hitting your APIs.
Coordination mechanisms to guide well-behaved bots to bulk data.

The small sites that are shutting down or going behind paywalls don't have to. With the right infrastructure, you can keep your APIs open to legitimate users while blocking the scrapers that would otherwise force you offline.