Three support tickets in one week, same root cause. An Amazon seller’s custom repricing tool kept showing wrong prices for the US marketplace. He’d built it himself in Python – Requests + BeautifulSoup, the classic combo. Solid code, actually.
The problem? His scraper was running from a German IP, so Amazon was serving him German-localized results instead of real US prices. His “competitive repricing” was based on data from the wrong country.
11
Tools tested
6
Categories covered
Free
All open-source
2026
All actively maintained
That’s the kind of bug you don’t find on Stack Overflow. And it made me think: picking the right Python scraping tool is only half the problem. Knowing when your tool is lying to you is the other half.
I’ve been running web scraping APIs that process millions of requests monthly for 3 years now. I have opinions about every tool in this space. Some of them are unpopular.
Requests is overrated for anything beyond hobby projects. Selenium should probably be retired. And the most underrated tool in this list is one most articles don’t even mention.
TL;DR: 11 Python scraping tools, organized by what they actually do – HTTP clients, parsers, browser automation, frameworks, stealth tools, and AI-powered crawlers. I tell you which one I’d pick for each job, and when you should skip scraping entirely and use an API instead.
I wrote a separate Python web scraping tutorial that walks through building scrapers from scratch. This article is different – it’s the tool comparison that post promised. Which library for which job, and why.
Quick comparison: all 11 tools at a glance
Before we get into the details, here’s the cheat sheet. Bookmark this – you’ll come back to it.
| Tool | Category | Best for | JS support | Difficulty |
|---|---|---|---|---|
| Requests | HTTP client | Quick scripts, APIs | ❌ | Easy |
| HTTPX | HTTP client | Async + HTTP/2 scraping | ❌ | Easy |
| curl_cffi | HTTP client | Anti-detection requests | ❌ | Medium |
| BeautifulSoup | HTML parser | Beginners, quick parsing | ❌ | Easy |
| Parsel / lxml | HTML parser | Fast XPath/CSS parsing | ❌ | Medium |
| Selenium | Browser automation | Legacy projects, testing | ✅ | Medium |
| Playwright | Browser automation | Modern dynamic sites | ✅ | Medium |
| Scrapy | Framework | Large-scale crawling | ❌ (plugin) | Hard |
| SeleniumBase | Stealth | Anti-bot evasion | ✅ | Medium |
| Crawl4AI | AI-powered | LLM-ready data extraction | ✅ | Easy |
| 2Captcha | Anti-captcha | CAPTCHA solving | — | Easy |
Now let me break down each category and tell you what I actually think.
HTTP clients: the foundation of every scraper
Every scraping project starts with an HTTP client – the part that actually downloads the raw HTML. You’d think they’re all the same. They’re not.
1. Requests – the default everyone uses (and maybe shouldn’t)
Requests is to Python scraping what jQuery was to JavaScript. Everybody starts with it, millions of tutorials teach it, and it works fine until it doesn’t.
| |
That’s the tutorial version. Clean, simple, 4 lines.
Here’s the reality: no HTTP/2 support, no async mode, HTTP/1.1 headers that are trivially fingerprinted, and basic connection pooling. That’s Requests in 2026.
For grabbing a few pages? Fine. For anything beyond hobby scraping, you’ll outgrow it in a week.
PROS
- ✅ Simplest API in the Python ecosystem
- ✅ 128M+ weekly downloads — every problem is on Stack Overflow
- ✅ Works with every Python version since 2.7
CONS
- ❌ No HTTP/2 — easy to fingerprint
- ❌ No async support — one request at a time
- ❌ No TLS fingerprint control
My verdict: Use it for throwaway scripts and learning. For anything that touches production, read the next entry.
128M+
Weekly downloads
HTTP/1.1
Only — easy to fingerprint
0
Async support
2. HTTPX – what Requests should have been
HTTPX is the modern replacement. Same familiar API, but with async support, HTTP/2, and proper connection management.
When I switched our internal scraping tests from Requests to HTTPX, blocked responses dropped by roughly 15%. Just from the HTTP/2 upgrade. That surprised me.
| |
Async mode is what makes it worth the switch. 100 pages synchronously with Requests? Minutes. HTTPX async? Seconds.
PROS
- ✅ HTTP/2 support — harder to fingerprint
- ✅ Async and sync APIs in one package
- ✅ Connection pooling, timeouts, retries done right
CONS
- ❌ Slightly fewer tutorials than Requests
- ❌ No TLS fingerprint impersonation (that's curl_cffi's job)
Bottom line: This is my default HTTP client for scraping in 2026. If you’re starting a new project, start here.
Before (Requests, sync)
~4 min
100 pages, one at a time
After (HTTPX, async)
~8 sec
100 pages, concurrent
3. curl_cffi – the anti-detection specialist
Here’s something most scraping articles won’t tell you: many sites don’t block you because of your headers or your IP. They block you because your TLS handshake looks nothing like a real browser.
curl_cffi fixes this. It’s built on cURL Impersonate and mimics the exact TLS fingerprint (JA3) of real browsers – Chrome, Firefox, Safari.
The server sees your request and thinks it’s Chrome 124. Not a Python script.
| |
That one parameter – impersonate="chrome" – and suddenly sites that block Requests and HTTPX let curl_cffi through without a second look.
PROS
- ✅ TLS/JA3 fingerprint impersonation — biggest anti-detection upgrade you can make
- ✅ Requests-compatible API — easy migration
- ✅ Async support, HTTP/2, WebSockets
CONS
- ❌ Requires C binary (cURL) — slightly harder to install
- ❌ Smaller community, fewer resources
- ❌ Can't execute JavaScript
Where it fits: Keep this in your back pocket. When HTTPX starts getting blocked and you don’t want to spin up a full browser, curl_cffi is the answer.
Bottom line on HTTP clients:
HTTPX for new projects. Switch to curl_cffi when TLS fingerprinting gets you blocked. Requests only for throwaway scripts where learning the other two isn't worth it.
HTML parsers: getting data out of the mess
HTTP clients download the page. Parsers extract the data. Different tools, different jobs. Every serious scraping project uses both.
4. BeautifulSoup – the friendly parser everyone knows
BeautifulSoup (BS4) is the most taught HTML parser in Python. If you’ve followed any scraping tutorial, you’ve used it. And honestly? For simple pages, it’s hard to beat.
| |
The API reads like English: find, find_all, select. No need to learn XPath or think about DOM trees. That matters when you’re prototyping or teaching someone the basics.
But it’s slow. On a benchmark parsing 10,000 product pages, BS4 with lxml took 3.2 seconds. Parsel did the same job in 0.8 seconds.
For small projects, who cares. For production pipelines processing millions of pages, that 4x difference adds up fast.
PROS
- ✅ Most beginner-friendly parser in Python
- ✅ Handles malformed/broken HTML gracefully
- ✅ Pluggable backends (lxml, html5lib, html.parser)
CONS
- ❌ 3-4x slower than lxml/Parsel
- ❌ Doesn't fetch pages — always needs an HTTP client
- ❌ No XPath support (CSS selectors only)
My verdict: The best teaching tool in scraping. For production, I reach for Parsel instead.
BeautifulSoup
3.2s
10,000 product pages
Parsel + lxml
0.8s
Same 10,000 pages — 4x faster
5. Parsel + lxml – the speed demons
Parsel is what Scrapy uses under the hood. It wraps lxml (the fastest HTML parser in Python, written in C) with an API that supports both CSS selectors and XPath.
| |
XPath is where Parsel pulls ahead. Want to find a <td> that contains the text “Price” and grab the value from the next column? One line in XPath. A three-step detour in BeautifulSoup.
PROS
- ✅ 3-4x faster than BeautifulSoup
- ✅ XPath + CSS selectors in one API
- ✅ Same parser Scrapy uses — proven at scale
CONS
- ❌ XPath has a learning curve
- ❌ Less forgiving with malformed HTML than BS4
- ❌ Fewer tutorials for beginners
My verdict: If you’re scraping anything serious, learn Parsel. The speed difference alone justifies it, and XPath is one of those skills you use for years.
Bottom line on parsers:
BeautifulSoup for learning and quick prototyping. Parsel + lxml for anything that touches production or processes more than a few hundred pages.
Browser automation: when you need a real browser
About 60% of modern websites load data via JavaScript after the initial page render. HTTP clients can’t see that content – they download the raw HTML before JS runs.
For those sites, you need a headless browser.
6. Selenium – the old guard
Selenium has been the default browser automation tool since 2004. Twenty-plus years. That’s both its strength and its problem.
| |
That explicit wait boilerplate – WebDriverWait, EC.presence_of_element_located – you’ll write it hundreds of times. Playwright does this automatically.
That single difference made me switch. Life’s too short for wait boilerplate.
Selenium still matters for one reason: legacy. If your team has 50,000 lines of Selenium tests, you’re not rewriting them. Fair enough.
But for a new project in 2026? I can’t think of a reason to pick Selenium over Playwright.
PROS
- ✅ Most mature browser automation tool (20 years)
- ✅ Supports Chrome, Firefox, Safari, Edge
- ✅ Massive community — 31K+ GitHub stars
CONS
- ❌ Slower than Playwright
- ❌ No auto-wait — manual wait logic everywhere
- ❌ Flaky tests are a known industry pain point
Honest take: Legacy tool. If you’re starting fresh, skip to Playwright below.
Selenium vs Playwright — why it matters
Selenium
Manual wait logic
WebDriverWait + expected_conditions on every element
⚠ Source of most "flaky test" complaints
Playwright
Auto-wait built in
Locators wait automatically — zero boilerplate
✓ Eliminates entire class of timing bugs
7. Playwright – the modern browser standard
Playwright is what I use for all browser-based scraping in 2026. Built by Microsoft, faster than Selenium, auto-waits for elements, and the API reads like pseudocode.
You write what you mean. It works.
| |
See the difference? No WebDriverWait. No expected_conditions. The locator just waits until the element appears. An entire class of flaky-test bugs, gone.
It also supports network interception – block image loads for faster scraping, capture API requests the page makes behind the scenes, or modify requests before they’re sent. Really useful for SPAs that fetch data through GraphQL or internal APIs.
I use Playwright daily in our Amazon data extraction API testing and for building scraper prototypes behind our Google Maps scraping API .
It’s also what I recommend in our Node.js scraping guide – works great in both Python and Node.
PROS
- ✅ Auto-wait — no flaky wait logic
- ✅ Network interception for API capture
- ✅ Faster execution than Selenium
- ✅ Multiple browser contexts in one instance
CONS
- ❌ Heavy — downloads browser binaries (~200MB)
- ❌ Resource intensive in production
- ❌ No built-in anti-bot evasion (that's SeleniumBase)
My verdict: If you need a browser for scraping, this is the one. No contest.
Pro tip:
Before building a browser-based scraper for Google, Amazon, or Maps, check whether a data extraction API already returns the structured JSON you need. You'll skip the browser overhead, proxy costs, and anti-bot maintenance entirely.
Structured JSON from Google, Amazon & Maps — no scraper needed
Frameworks: when scripts aren’t enough
At some point, a scraping script turns into a scraping system. You need request queuing, retry logic, rate limiting, data pipelines, output formats, duplicate detection. Building all that yourself is a project in itself.
8. Scrapy – the enterprise crawler
Scrapy is the most powerful scraping framework in Python. Not close. It handles HTTP requests, URL queuing, depth control, rate limiting, data cleaning pipelines, and exports to JSON, CSV, XML, or databases.
Entire scraping infrastructure in one pip install.
| |
That spider handles pagination, data extraction, and URL following in 15 lines. Scrapy runs it asynchronously, respects rate limits, retries failed requests. Export to JSON with zero additional code.
The tradeoff is the learning curve. Scrapy has its own way of doing things – spiders, items, pipelines, middlewares, settings. Takes a solid week to feel comfortable.
But once you’re past that week, you can crawl entire sites with hundreds of thousands of pages reliably.
If you need Amazon product data at scale , Scrapy is what I’d use for the DIY approach. I used a similar spider pattern when building our Amazon price tracker tutorial .
Or, you know, just call the API and skip the 500 lines of spider code.
PROS
- ✅ Complete scraping infrastructure in one package
- ✅ Async architecture — handles thousands of pages/minute
- ✅ Built-in data pipelines, exports, duplicate filtering
- ✅ Battle-tested at scale (used by companies processing billions of pages)
CONS
- ❌ Steep learning curve — it's a framework, not a library
- ❌ No JavaScript rendering (needs Scrapy-Playwright or Splash plugin)
- ❌ Overkill for small, one-off scraping jobs
My take: The right choice if you’re building a crawler that needs to run reliably for months. Overkill if you’re scraping one page.
If you can scrape it with HTTPX, don’t use Scrapy. If you need Scrapy, you’ll know. Your script outgrew its
whileloop three refactors ago.
Stealth and anti-bot evasion
Modern websites don’t just serve HTML. They fingerprint your browser, analyze mouse movements, deploy CAPTCHAs, and use services like Cloudflare, DataDome, and PerimeterX to block automated access.
These tools fight back.
9. SeleniumBase – Selenium with stealth mode
SeleniumBase takes Selenium and adds what it desperately needed: anti-bot evasion.
UC Mode (Undetected Chrome) patches Chrome to avoid common detection fingerprints – WebDriver flags, navigator properties, Chrome DevTools Protocol leaks.
| |
That uc=True parameter enables stealth mode. uc_gui_handle_cf() automatically handles Cloudflare “checking your browser” challenges.
For sites behind basic anti-bot protection, SeleniumBase gets through more often than not.
The catch? It’s still a browser. Slow, resource-heavy, and at scale (hundreds of concurrent sessions), expensive to run.
For high-volume scraping of Google search results or Amazon product pages , a managed data API ends up being both cheaper and more reliable.
PROS
- ✅ UC Mode bypasses Cloudflare and similar protections
- ✅ Built on Selenium — familiar API
- ✅ Automatic browser/driver management
- ✅ CAPTCHA-handling capabilities
CONS
- ❌ Heavy resource usage — one browser per session
- ❌ Anti-bot evasion breaks with every Chrome update
- ❌ Not reliable against advanced protections (DataDome, PerimeterX)
My verdict: Best free option for bypassing Cloudflare. Don’t expect it to work against every anti-bot system – the arms race moves fast and SeleniumBase is always playing catch-up.
10. 2Captcha – solving CAPTCHAs programmatically
Not a scraping library per se, but a service you’ll inevitably need. 2Captcha uses real humans and AI to solve CAPTCHAs – reCAPTCHA v2/v3, hCaptcha, FunCaptcha, image captchas.
| |
Pricing is about $2.99 per 1,000 normal CAPTCHAs and $2.99 per 1,000 reCAPTCHA v2 solves. Not free, but $3 for 1,000 solves is cheap when a CAPTCHA wall is the only thing between you and the data.
CapSolver is a solid alternative with similar pricing and faster solve times for some CAPTCHA types. Both work with Selenium, Playwright, and SeleniumBase.
What it is: A utility tool, not a scraping tool. But when you need it, nothing else will do.
If you’re hitting CAPTCHAs often enough to care about the cost, worth asking whether a scraping data API would save you money compared to solving them yourself.
The real cost of anti-bot evasion:
Proxy rotation ($50-200/mo), CAPTCHA solving ($3/1K solves), and the hours debugging broken selectors. Add it up and a managed API at $9.99/month starts looking like a bargain. Do the math for your volume before committing to the DIY path.
AI-powered scraping: the new wave
Most scraping listicles skip this category. I think it’s the most interesting one in 2026. These tools don’t just download and parse pages – they understand the content.
11. Crawl4AI – turning websites into LLM-ready data
Crawl4AI is an open-source crawler built for one specific job: turning web pages into clean, structured data that LLMs can consume.
Instead of writing CSS selectors for every field, you describe what you want in natural language and let the LLM extract it.
| |
The result.markdown output is where it gets interesting. Crawl4AI strips navigation, ads, footers, boilerplate – gives you just the main content as clean markdown.
Feed that to Claude or GPT and you get structured data extraction without writing a single CSS selector.
I’ve been experimenting with it for competitive research – crawling competitor pricing pages and extracting structured comparisons automatically. Works well when the HTML layout varies between pages and hand-writing selectors would be a nightmare.
PROS
- ✅ Clean markdown output — perfect for LLM pipelines
- ✅ No CSS selectors needed for basic extraction
- ✅ Handles JavaScript rendering
- ✅ Open-source and actively maintained
CONS
- ❌ LLM extraction adds latency and cost (API calls)
- ❌ Less precise than hand-written selectors
- ❌ Relatively new — smaller community
My verdict: My favorite new tool on this list. Not a replacement for traditional scraping, but for unstructured data extraction and AI pipelines, it opens up jobs that were impractical before.
The scraping stack at a glance
3
HTTP clients
Requests · HTTPX · curl_cffi
2
Parsers
BeautifulSoup · Parsel
2
Browsers
Selenium · Playwright
1
Framework
Scrapy
2
Anti-bot
SeleniumBase · 2Captcha
1
AI-powered
Crawl4AI
Which python scraping tool should you actually use?
Here’s my decision tree after going through all 11:
Scraping a static site for a quick project?
HTTPX + BeautifulSoup. Fast to write, fast to run, handles 80% of scraping jobs.
Need to scrape JavaScript-rendered pages?
Playwright. Auto-waits, modern API, network interception. There's no reason to pick Selenium for a new project.
Building a large-scale crawler?
Scrapy + Parsel. It's a framework, not a library — invest the learning time, it pays back tenfold.
Getting blocked by anti-bot systems?
Try curl_cffi first (TLS fingerprinting). If that's not enough, SeleniumBase UC Mode. If that still fails, the site has won — use an API.
Need data from Google, Amazon, or Google Maps?
Honestly? Don't scrape them yourself. These sites have the most aggressive anti-bot systems on the internet. I built FlyByAPIs specifically because maintaining scrapers for these targets was a full-time job. A single API call returns structured JSON — no proxies, no CAPTCHAs, no maintenance.
When to skip scraping entirely and use an API
Bear with me on this digression. I think it’s worth making.
I just spent a lot of words telling you which tools to use for scraping. But sometimes the right move is… not scraping at all.
Maintaining scrapers for heavily protected sites – Google, Amazon, LinkedIn, Google Maps – is exhausting. I had an Amazon scraper break three times in ten days because they kept rotating HTML class names. Each fix: 2-4 hours of debugging, plus re-running every failed job. Multiply that across months.
That’s exactly why I built FlyByAPIs. So nobody else has to maintain that infrastructure.
For these targets, a data API like FlyByAPIs replaces the whole scraping stack. HTTP client, parser, proxy rotation, anti-bot evasion, CAPTCHA solving, and the maintenance. One API call, structured JSON response, done.
The Google Search API returns SERPs, People Also Ask, autocomplete, and featured snippets.
The Amazon scraper API returns product data, prices, reviews, and search results with country-pinned IP routing for accurate local pricing.
We also run a Crunchbase data API for company enrichment, a jobs search API for recruitment pipelines, a Google Maps data API for location intelligence, and a translation API for multilingual scraping workflows.
All start free on RapidAPI. Scale to millions of requests.
Free tier included — no credit card required
That said, if your target is a simple site, a small e-commerce store, or a niche forum without anti-bot protection, scraping is absolutely the right call.
Don’t overpay for an API when httpx + beautifulsoup does the job in 10 lines. I mean that. The point is knowing when each approach makes sense, not picking a side.
The Python scraping ecosystem in 2026 is genuinely good. Most of these tools are mature and well-documented, and they won’t cost you a cent. The hard part isn’t picking a library – it’s knowing when to switch from one category to the next.
Start simple. HTTPX + BeautifulSoup for your first project. Playwright when you hit JavaScript walls. Scrapy when you need scale.
And when you’re spending more time fighting anti-bot systems than building the actual thing you set out to build, that’s your sign to look at a data API.
I update this list every few months as tools change. If I missed your favorite Python scraping library, or if you think I’m wrong about Selenium (I might be), let me know.
Oriol.
