How to Fight Against AI Scrapers with an API Gateway
January 14, 2026
Last month, MetaBrainz—the nonprofit behind MusicBrainz and ListenBrainz—made a painful announcement: they were locking down their public APIs. The reason? AI scrapers had been hammering their servers so aggressively that legitimate users couldn't access the service. Rather than downloading the freely available data dumps, these bots were crawling the entire database one page at a time, a process that would take "hundreds of years to complete" and was "utterly pointless."
MetaBrainz's story is far from unique. Across the web, independent developers and small businesses are reporting eerily similar battles: a personal website maintained for over two decades, shut down after being overwhelmed; a passion project forced behind a paywall when its modest budget could no longer sustain the robotic traffic; companies watching their operational costs skyrocket while real users face frustrating timeouts and errors.
The common thread? Traditional defenses don't work anymore. Robots.txt is ignored. User-Agent filtering is trivially bypassed. IP blocking is a game of whack-a-mole against cloud providers with infinite IP ranges.
This post explores why AI scrapers are different from traditional bots, what actually works to stop them, and how to implement a multi-layered defense using Apache APISIX as your API Gateway.
Why AI Scrapers Are Different
Traditional web scrapers were typically operated by individuals or small companies with limited resources. They respected robots.txt because getting blocked meant starting over. They had identifiable patterns because they were optimized for specific targets.
AI scrapers operate under a fundamentally different model:
| Characteristic | Traditional Scrapers | AI Scrapers |
|---|---|---|
| Scale | Targeted, specific sites | Everything, everywhere |
| Resources | Limited infrastructure | Backed by billion-dollar companies |
| Robots.txt | Generally respected | Frequently ignored |
| Goal | Extract specific data | Consume all available content |
| Efficiency | Optimized for speed | Brute-force, page-by-page |
| Accountability | Traceable operators | Anonymous cloud instances |
The Five Layers of API Protection
Effective protection against AI scrapers requires a defense-in-depth approach. No single technique is sufficient; you need multiple layers working together.
Let's implement each layer using Apache APISIX.
Prerequisites
- Install Docker to be used in the quickstart script to create containerized etcd and APISIX.
- Install cURL to be used in the quickstart script and to send requests to APISIX for verification.
Step 1: Get APISIX
APISIX can be easily installed and started with the quickstart script:
curl -sL https://run.api7.ai/apisix/quickstart | sh
You will see the following message once APISIX is ready:
✔ APISIX is ready!
Step 2: Layer 1 - IP-Based Filtering
First, let's block known problematic IP ranges. Azure and certain cloud providers are frequently mentioned as sources of aggressive scraping.
Create a route with IP restrictions:
curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ -H "X-API-KEY: ${admin_key}" \ -d '{ "id": "protected-api", "uri": "/api/*", "upstream": { "type": "roundrobin", "nodes": { "your-backend:8080": 1 } }, "plugins": { "ip-restriction": { "blacklist": [ "20.0.0.0/8", "40.74.0.0/15", "52.224.0.0/11" ], "message": "Access denied. If you believe this is an error, please contact support." } } }'
Important: This is a blunt instrument. Use it carefully and consider your legitimate users who might be on cloud platforms.
Step 3: Layer 2 - User-Agent Validation
While User-Agent strings can be spoofed, many scrapers don't bother. Let's block known AI crawler User-Agents:
curl "http://127.0.0.1:9180/apisix/admin/routes/protected-api" -X PATCH \ -H "X-API-KEY: your-secure-admin-key" \ -d '{ "plugins": { "ua-restriction": { "bypass_missing": false, "denylist": [ "GPTBot", "ChatGPT-User", "CCBot", "anthropic-ai", "Claude-Web", "Bytespider", "PetalBot", "Amazonbot", "meta-externalagent", "FacebookBot", "cohere-ai", "PerplexityBot", "Applebot-Extended" ], "message": "AI scrapers are not permitted. Please use our data dump at /downloads/data.tar.gz" } } }'
Notice the custom message—this implements the "coordination" solution discussed on Hacker News. We're telling scrapers where to find bulk data.
Step 4: Layer 3 - Intelligent Rate Limiting
This is where APISIX really shines. We'll implement a two-tier rate limiting strategy:
- Burst protection: Allow short bursts for legitimate users
- Sustained rate limiting: Prevent long-term abuse
curl "http://127.0.0.1:9180/apisix/admin/routes/protected-api" -X PATCH \ -H "X-API-KEY: ${admin_key}" \ -d '{ "plugins": { "limit-req": { "rate": 10, "burst": 20, "key_type": "var", "key": "remote_addr", "rejected_code": 429, "rejected_msg": "{\"error\": \"Rate limit exceeded. For bulk data access, please download our data dump at /downloads/data.tar.gz\", \"retry_after\": 60}" }, "limit-count": { "count": 1000, "time_window": 3600, "key_type": "var", "key": "remote_addr", "rejected_code": 429, "policy": "local" } } }'
What this does:
limit-req: Allows 10 requests per second with bursts up to 20limit-count: Caps total requests at 1,000 per hour per IP
The custom error message includes a retry_after value and points to the bulk download—addressing the Hacker News discussion about whether programs read 429 response bodies.
Step 5: Layer 4 - API Key Authentication
For sensitive endpoints, require authentication. This is exactly what MetaBrainz did with their /metadata/lookup endpoint.
First, create a consumer:
curl "http://127.0.0.1:9180/apisix/admin/consumers" -X PUT \ -H "X-API-KEY: ${admin_key}" \ -d '{ "username": "legitimate-user", "plugins": { "key-auth": { "key": "user-specific-api-key-here" } } }'
Then protect specific endpoints:
curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ -H "X-API-KEY: ${admin_key}" \ -d '{ "id": "authenticated-api", "uri": "/api/metadata/*", "upstream": { "type": "roundrobin", "nodes": { "your-backend:8080": 1 } }, "plugins": { "key-auth": { "header": "Authorization", "query": "api_key" }, "limit-count": { "count": 10000, "time_window": 3600, "key_type": "var", "key": "consumer_name", "rejected_code": 429, "policy": "local" } } }'
Now authenticated users get higher rate limits (10,000/hour vs 1,000/hour for anonymous), and you can track usage per user.
Step 6: Layer 5 - Honeypot Detection
This is the clever part. Create a hidden endpoint that legitimate users would never access, but scrapers following links blindly will hit:
curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ -H "X-API-KEY: ${admin_key}" \ -d '{ "id": "honeypot", "uri": "/api/internal/debug-data", "plugins": { "serverless-pre-function": { "phase": "access", "functions": [ "return function(conf, ctx) local core = require(\"apisix.core\"); local ip = core.request.get_remote_client_ip(ctx); core.log.warn(\"HONEYPOT HIT from IP: \", ip); return 403, {error = \"Access denied\"} end" ] } } }'
Add a hidden link to this endpoint in your HTML (invisible to users but visible to scrapers):
<a href="/api/internal/debug-data" style="display:none">Debug Data</a>
Any IP that hits this endpoint is almost certainly a scraper. You can then add it to your blocklist automatically.
Step 7: Implement Smart 429 Responses
Following the Hacker News discussion about retry-after headers [2], let's configure proper 429 responses:
curl "http://127.0.0.1:9180/apisix/admin/routes/protected-api" -X PATCH \ -H "X-API-KEY: ${admin_key}" \ -d '{ "plugins": { "response-rewrite": { "headers": { "set": { "Retry-After": "60", "X-RateLimit-Limit": "1000", "X-RateLimit-Remaining": "$limit_count_remaining", "X-Bulk-Download": "/downloads/data.tar.gz" } } } } }'
The custom X-Bulk-Download header implements the "coordination" solution—telling well-behaved bots where to find bulk data.
Monitoring and Observability
Protection is only half the battle. You need to know what's happening. Enable Prometheus metrics to track:
- Request rates by IP
- 429 responses over time
- Blocked User-Agents
- Honeypot hits
Access metrics at http://127.0.0.1:9091/apisix/prometheus/metrics:
# HELP apisix_http_status HTTP status codes # TYPE apisix_http_status counter apisix_http_status{code="429",route="protected-api"} 15234 apisix_http_status{code="200",route="protected-api"} 89421 # HELP apisix_bandwidth Total bandwidth # TYPE apisix_bandwidth counter apisix_bandwidth{type="ingress",route="protected-api"} 1234567890
The Coordination Solution: robots.txt 2.0
One of the most insightful points from the Hacker News discussion was the need for a new standard—a way for websites to say "here's bulk data, use this instead of scraping".
While we wait for that standard, you can implement it yourself using APISIX:
- Create a well-known endpoint:
/well-known/bulk-data.json - Return structured information about available data dumps
- Reference it in your robots.txt
{ "bulk_data_available": true, "downloads": [ { "name": "Full Database Dump", "url": "/downloads/full-dump-2026-01-14.tar.gz", "format": "tar.gz", "size_bytes": 1234567890, "updated": "2026-01-14T00:00:00Z", "checksum_sha256": "abc123..." } ], "api_documentation": "/docs/api", "rate_limits": { "anonymous": "1000 requests/hour", "authenticated": "10000 requests/hour" }, "contact": "api-support@example.com" }
Add to your robots.txt:
User-agent: * Disallow: /api/internal/ # Bulk data available - please use instead of scraping # See: /well-known/bulk-data.json
Results: Before and After
Here's what you can expect after implementing these protections:
| Metric | Before | After |
|---|---|---|
| Scraper traffic | 60% of requests | < 5% of requests |
| Server load | Constant spikes | Stable, predictable |
| Infrastructure cost | Unpredictable | Reduced by 40-60% |
| Legitimate user experience | Degraded during attacks | Consistent |
| Visibility into abuse | None | Complete |
Conclusion: The New Reality of API Protection
The era of trusting robots.txt is over. AI companies have made it clear that they'll scrape first and ask questions never. As one Hacker News commenter observed:
"We can't have nice things because the powers that be decided that adtech money was worth far more than efficiency, interoperability, and things like user privacy and autonomy."
But you're not powerless. An API Gateway like Apache APISIX gives you the tools to fight back:
- Multi-layer defense that doesn't rely on any single technique.
- Intelligent rate limiting that distinguishes humans from bots.
- Observability to understand what's hitting your APIs.
- Coordination mechanisms to guide well-behaved bots to bulk data.
The small sites that are shutting down or going behind paywalls don't have to. With the right infrastructure, you can keep your APIs open to legitimate users while blocking the scrapers that would otherwise force you offline.