Why do web scrapers get blocked even with proxies?

Proxies only fix one of three signals. If your requests still come too fast, or your TLS and browser fingerprint scream 'automation', a clean IP won't save you. Cheap datacenter proxies are also pre-flagged on many sites, so they can make things worse, not better.

How many requests per second is safe before a 429?

There's no universal number. The threshold depends on the target site, the quality of your proxy pool, and whether your session looks human. The same rate that runs fine through residential IPs with real cookies will trip a 429 instantly through flagged datacenter IPs.

What is a 429 status code in scraping?

HTTP 429 means 'Too Many Requests': the server is rate-limiting you. It's the politest block: the site is telling you to slow down rather than banning you outright. Back off, add delays, and rotate IPs before it escalates to a 403 or a CAPTCHA wall.

What is TLS or JA3 fingerprinting?

When your client opens an HTTPS connection, it sends its supported ciphers and extensions in a specific order. That order forms a fingerprint (a JA3 signature) that's different for Chrome versus a Python script. Servers use it to spot automation regardless of your IP or headers.

Does using a headless browser stop me from getting blocked?

Not by itself. Headless browsers fix JavaScript rendering and some fingerprint signals, but default automation flags (like navigator.webdriver) and TLS mismatches still leak. They're heavier and slower, too. They help, but they're not a magic bypass.

Is it better to build a scraper or use a scraping API?

Build it if scraping is your core product and you want full control. Use an API when the data is a means to an end and you'd rather not babysit proxies, fingerprints, and CAPTCHAs. A managed service like our Google Search API absorbs the anti-bot arms race so you don't have to.

Is web scraping legal?

Scraping publicly available data is generally permitted in many jurisdictions, but it depends on the site's terms, the data type, and your location. Courts have ruled differently across cases. Always check the target's terms of service and applicable law before running at scale.

Why Web Scrapers Get Blocked (and How to Fix It)

Web scrapers get blocked when a website’s anti-bot systems detect automated traffic and refuse it, returning an HTTP 403 (Forbidden), an HTTP 429 (Too Many Requests), or a CAPTCHA challenge instead of the requested page. A scraper that ran fine one day can stop working the next, with no change to its code, once the target site flags it.

The reasons why web scrapers get blocked rarely come down to a single clever detection. Almost every block traces back to three signals an automated client tends to leak: the rate of its requests, the quality of the proxies it routes through, and how consistently its fingerprint matches a real browser.

In short: Web scrapers get blocked because of three signals: request rate, proxy quality, and browser fingerprint. There is no universal requests-per-second threshold that is safe. The same rate runs clean through good residential proxies and trips an instant block through flagged datacenter IPs. Fix the weakest of the three signals, not just the speed.

Signals that get you blocked

429

"Too Many Requests"

403

"Forbidden": you're flagged

JA3

Your TLS fingerprint

The sections below cover each signal in turn: how it triggers a block, why no single request rate is universally safe, and the common techniques used to reduce detection.

The three reasons why web scrapers get blocked

Strip away the jargon and almost every block traces back to one of three causes. A scraper gets flagged because it is:

Making too many requests in a short time

Volume and speed that no human could produce. This trips rate limits and earns you a 429.

Using low-quality proxies

Datacenter IPs that are already on blocklists, shared by thousands of other scrapers.

Not covering your fingerprint properly

Headers, TLS handshake, and browser signals that don't match a real user.

That’s the whole list: rate, proxies, fingerprint. These three account for the overwhelming majority of blocks seen in production.

The mistake most people make is fixing one and ignoring the other two. You buy proxies, the blocks keep coming, and you blame the proxies. The proxies were fine. Your fingerprint gave you away.

Why there’s no magic requests-per-second number

People always want one number. “How many requests per second before I get blocked?”

I get why. It would make life easy. But I have to be honest: that number doesn’t exist, and anyone who gives you one is selling something.

The rate that’s safe depends on the other two signals. The same 10 requests per second that runs clean through residential proxies with real cookies will trip an instant 429 through flagged datacenter IPs with a Python-shaped fingerprint.

Bottom line:

Block rate isn't a fixed threshold. It's the product of three variables: proxy quality, session quality (cookies and headers), and fingerprint consistency. Improve the weak one and your safe rate goes up.

So when your scraper dies, don’t ask “was I too fast?” Ask “which of the three was weakest?” Usually it’s not speed. Speed is just the thing that finally tipped the balance.

Bad proxies are the most common cause

If I had to bet on why your scraper is blocked right now, I’d bet on the proxies. It’s the cause people underestimate the most.

Not all proxies are equal. The cheap ones are cheap for a reason.

Datacenter proxies

Cheap, fast, and easy to detect. Their IP ranges are known and often pre-flagged. Shared pools mean someone else already burned the IP you just got.

Residential proxies

Real ISP-assigned IPs from actual devices. Far harder to flag because blocking them risks blocking real users. More expensive, but they survive.

There’s a second proxy trap nobody warns you about: geography. Many sites serve different content, or block outright, based on where the IP says you are.

Scrape a US marketplace from a German IP and you may get a different page, wrong prices, or a hard block. The fix is country-pinning: routing each request through an IP inside the target country. You can do this with country-targeted residential proxies, or with a managed service that handles it for you (it’s how our Amazon data API routes marketplace requests).

The same applies to local listings, search results, and anything that varies by region. Route through the wrong country and you get data no real user there would ever see.

Your fingerprint gives you away

Here’s the signal that catches people who did everything else right. You rotated good residential proxies, you slowed down, and you still get blocked. Why?

Because your client looks nothing like a browser at the protocol level.

When any client opens an HTTPS connection, it sends its supported ciphers and extensions in a particular order. Chrome sends them one way. A Python requests script sends them another.

That order is a fingerprint, often expressed as a JA3 signature, and a server can read it before a single byte of your request body arrives.

The trap:

You can set a perfect Chrome User-Agent header and still get blocked, because your TLS handshake says "Python." The header and the handshake disagree, and that mismatch is the tell.

Fingerprinting goes beyond TLS. It includes your header order, whether you send the headers a real browser sends, HTTP/2 frame settings, and JavaScript signals like navigator.webdriver. Each one is a chance to look wrong.

This is the part of the arms race that never ends. Sites add a new signal, scrapers adapt, repeat. It’s genuinely exhausting to keep up with, and it’s the main reason teams eventually stop rolling their own.

Blocked vs. configured: a quick Python example

Here’s the naive version almost everyone starts with. It works on a friendly site and gets a 429 or 403 on a defended one.

1
2
3
4
5
import requests

# The version that gets blocked
r = requests.get("https://example.com/search?q=laptops")
print(r.status_code)  # 429 or 403 before long

No real headers, no session, no pacing. To a defended site this is a flashing sign that says “bot.”

A more careful version reuses a session, sends browser-like headers, and paces itself. It survives longer, though it still won’t beat TLS fingerprinting on the toughest targets.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import requests, time, random

s = requests.Session()
s.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml",
})

for q in ["laptops", "monitors", "keyboards"]:
    r = s.get(f"https://example.com/search?q={q}")
    print(q, r.status_code)
    time.sleep(random.uniform(2, 5))  # pace like a human

It’s better. But notice what’s missing: a clean residential IP and a matching TLS fingerprint. Those two are the hard part, and they’re exactly what a managed search API takes off your plate.

When to stop building and use an API

Building your own scraper is a fine choice when scraping is your core product. You want control, you have the time, and the arms race is your job.

But if the data is a means to an end, the math changes fast. You didn’t set out to maintain a proxy pool and reverse-engineer JA3 signatures. You wanted the data.

That’s when a managed scraping API makes sense. A service like our Google Search API handles all three signals for you: clean residential routing, human-like pacing, and a fingerprint that matches a real browser. You send a query, you get structured JSON back.

The same logic covers the other tough targets. CAPTCHA-heavy sites like Amazon (see our breakdown of Amazon’s CAPTCHA systems ), data behind logins, and job boards all face the same anti-bot wall, and a managed endpoint handles it.

Build it yourself

Scraping is your core product, you want full control, and maintaining proxies and fingerprints is time you're happy to spend.

Use a scraping API

The data is a means to an end. You'd rather ship features than babysit the anti-bot arms race. Let a managed Google search results API handle the blocking.

If you want the deeper benchmark on which managed option is actually cheapest, I covered that in the best web scraping API comparison . And if you’re set on rolling your own, our Python web scraping guide is the honest starting point.

So why did your scraper die on Thursday?

Go back to the three signals. It wasn’t bad luck and the site didn’t single you out. One of the three tipped over.

Most likely your proxies got flagged, or the site added a fingerprint check your requests script couldn’t match. Speed was just the trigger, not the cause.

Fix the weakest of the three and your scraper comes back to life. Or hand all three to an API that does nothing but fight this battle, and go build the thing you actually wanted to build.

The one thing to remember:

Speed is the trigger, not the cause. Rate, proxies, and fingerprint are the three signals that decide whether you get through. Find the weakest one before you blame the speed.

What are you scraping when it breaks? I’m always curious which of the three gets people most.

P.S. If you only fix one thing today, fix your proxies. It’s the cause we see behind more blocks than the other two combined.

Oriol.

Why Web Scrapers Get Blocked (and How to Fix It)

The three reasons why web scrapers get blocked

Why there’s no magic requests-per-second number

Bad proxies are the most common cause

Your fingerprint gives you away

Blocked vs. configured: a quick Python example

When to stop building and use an API

Build it yourself

Use a scraping API

So why did your scraper die on Thursday?

Frequently Asked Questions

Q Why do web scrapers get blocked even with proxies?

Q How many requests per second is safe before a 429?

Q What is a 429 status code in scraping?

Q What is TLS or JA3 fingerprinting?

Q Does using a headless browser stop me from getting blocked?

Q Is it better to build a scraper or use a scraping API?

Q Is web scraping legal?

Ready to stop maintaining scrapers?