pip install requests beautifulsoup4 — that’s where every python web scraping tutorial starts. And for about 30 minutes, everything works perfectly.
You fetch a page, parse the HTML, pull out the data you need. This is easy, you think. Then you try a real website and get a 403 Forbidden. Or worse: the page loads fine but the data you need simply isn’t in the HTML because it’s loaded by JavaScript after the page renders.
5
Methods covered
Python
All examples in Python
Copy-paste
Working code examples
Beginner →
Progressive difficulty
Web scraping in Python is the practice of programmatically fetching web pages and extracting structured data from HTML — using libraries like requests for downloading and BeautifulSoup for parsing. For sites with heavy anti-bot protection, a dedicated data API can replace the scraping logic entirely with a simple call that returns clean JSON.
I’ve been building Python web scraping infrastructure for over three years. I’ve written scrapers that worked beautifully for a week and scrapers that broke before I finished my coffee. The difference between those two outcomes comes down to understanding which tool fits which problem.
TL;DR: This tutorial covers 5 Python web scraping methods — from
requests + BeautifulSoupfor static pages to Selenium for JavaScript-heavy sites. For production scraping against anti-bot targets like Amazon and Google, FlyByAPIs data APIs replace 30+ lines of Selenium code with a single GET request, starting free on RapidAPI.
This isn’t a tool comparison (that’s a [separate article coming soon]). This is the practical how-to — code you can run right now, with honest explanations of where each approach breaks down.
What is web scraping (and when do you need it)?
Web scraping is pulling data from websites programmatically. Instead of copying and pasting by hand, you write code that does it for you — fetching pages, parsing the HTML, and extracting the pieces you care about.
You need it when:
- A website has data you want but no API to access it
- You’re tracking prices, monitoring competitors, or collecting research data
- You need to scrape web data from hundreds or thousands of pages — way more than any human could do manually
Python is the go-to language for web scraping because of its ecosystem. The best python tools for web scraping — requests, BeautifulSoup, Selenium, and Scrapy — cover every level of complexity. And Python’s readable syntax means your scraping scripts stay maintainable even months later.
But here’s the thing I wish someone had told me earlier: the difficulty of scraping a website has almost nothing to do with the data you want. It depends entirely on how the site is built and how hard it fights back against automated access.
A static blog? Ten lines of code. Amazon product data? That’s a full engineering project .
Let me show you the progression.
Scraping static pages with requests and BeautifulSoup
This is level one. The simplest web scraping setup in Python — and honestly, it handles more websites than you’d expect.
First, install both libraries:
| |
Here’s a working example that scrapes article titles from a page:
| |
That’s it. Six lines of actual code. requests.get() downloads the HTML, BeautifulSoup parses it into a navigable tree, and .select() finds elements using CSS selectors — the same selectors you’d use in your browser’s DevTools.
How to find the right selector
Open the page in Chrome, right-click the element you want, and click Inspect. You'll see the HTML structure. Look for class names, IDs, or tag patterns that uniquely identify your target data. Then use those in soup.select() or soup.find().
A few more methods you’ll use constantly:
| |
Saving scraped data to CSV
Once you’ve extracted data, you’ll probably want to save it. Python’s built-in csv module works fine:
| |
This approach works great for blogs, news sites, documentation pages, forums, and any website where the content is in the HTML that the server sends. No JavaScript rendering needed — requests downloads the raw HTML, and everything you need is right there.
But not every website is that cooperative.
Why some websites block your scraper (and how to fix it with headers)
You write a perfectly working scraper. It works on one site. You point it at another site and get a 403 Forbidden or a 406 Not Acceptable. What happened?
The answer is almost always headers.
When your browser visits a website, it sends along a bunch of metadata: what browser you’re using, what languages you accept, what site referred you. These are HTTP headers. When requests.get() sends a request with no headers, the server sees something like this:
| |
That’s a dead giveaway. The server knows you’re a script, not a browser. Many sites block requests like this automatically.
The fix is simple — add realistic headers:
| |
Pro tip
Copy the exact headers your browser sends. Open DevTools → Network tab → click any request → scroll to "Request Headers". Copy the User-Agent, Accept, and Accept-Language values. These are the headers that real browsers send — using them makes your requests look legitimate.
Being a good scraping citizen
Headers get you past the first gate. But if you then blast a server with 100 requests per second, you’ll get blocked for a different reason — rate limiting.
| |
A random delay between 1–3 seconds mimics human browsing behavior and keeps most servers happy. Also check the site’s robots.txt (add /robots.txt to any domain) to see what the site owner allows.
These two things — proper headers and polite delays — solve 80% of blocking issues when you scrape websites with python. But there’s a whole category of websites where headers alone aren’t enough.
Scraping dynamic websites with Selenium
Here’s the moment that trips up every beginner: you inspect a page in your browser, see all the data right there in the DOM, but when you download it with requests the data is missing.
The reason? The content is loaded by JavaScript after the initial HTML loads. Modern web frameworks like React, Angular, and Vue build the page dynamically in the browser. The server sends a mostly empty HTML shell, and JavaScript fills it in.
Key distinction
requests only downloads that empty shell. It doesn't execute JavaScript. For sites that render content client-side, you need a real browser — that's where Selenium comes in.
Selenium controls an actual Chrome (or Firefox) browser programmatically — navigating pages, waiting for JavaScript to render, clicking buttons, filling forms. Your Python code drives the browser like a puppet.
Install Selenium:
| |
You’ll also need ChromeDriver installed and matching your Chrome version. The easiest way in 2026:
| |
Here’s a basic example that scrapes a JavaScript-rendered page:
| |
Notice the --headless=new flag. That’s important — let’s talk about it.
Headless mode: running Selenium without a visible browser
Headless mode means the browser runs in the background with no visible window. Same rendering engine, same JavaScript execution, but no GUI. This is what you want for production scraping and server environments.
The --headless=new argument in Chrome enables the new headless mode (Chrome 112+), which behaves identically to a regular browser window — just invisible.
| |
Why does this matter? Two reasons:
- Speed. Headless mode is faster because it doesn’t need to actually draw pixels on screen.
- Servers. If you’re running a scraper on a cloud server or in a Docker container, there’s no display available. Headless is your only option.
Watch out
Some websites detect headless browsers by checking for certain JavaScript properties that differ between headless and regular Chrome. If a site works in your browser but fails in headless mode, that's likely why. The next section covers a solution.
When Selenium itself isn’t enough
For most websites, Selenium in headless mode works. But the big platforms — Amazon, Google, LinkedIn, major e-commerce sites — have anti-bot systems that specifically detect and block Selenium.
They detect it because default Selenium sets a navigator.webdriver flag to true in the browser, among other tells. Anti-bot services like Cloudflare, DataDome, and PerimeterX check for these flags automatically — and block your scraper before it extracts a single byte.
Undetected chromedriver: getting past anti-bot detection
For sites with serious bot detection, there’s undetected-chromedriver. It patches Selenium to remove the automation flags that anti-bot systems look for.
| |
| |
The API is almost identical to regular Selenium — you just swap webdriver.Chrome for uc.Chrome. Under the hood, it patches the ChromeDriver binary to remove detectable traces.
I’m keeping this brief on purpose. undetected-chromedriver solves one specific problem (automation detection), but it doesn’t solve the bigger problems of production scraping: IP rotation, CAPTCHA solving, maintaining sessions across hundreds of requests, and dealing with sites that change their HTML structure every few weeks.
The scraping difficulty ladder
Static HTML pages
requests + BeautifulSoup. Done in 10 lines. Works reliably forever.
Sites that check headers
Add a realistic User-Agent and request headers. Still just requests + BS4.
JavaScript-rendered content
Selenium in headless mode. Slower, heavier, but handles dynamic pages.
Anti-bot protected sites
Undetected chromedriver + proxies + delays + ongoing maintenance. Gets expensive fast.
Major platforms (Amazon, Google, LinkedIn)
Full anti-bot evasion stack, rotating residential proxies, constant maintenance. Or... use an API.
I’ve spent three years on level 5. Building and maintaining scrapers for Google Search results, Amazon product data, Google Maps listings, and more. And at some point I had to ask myself: is it worth having every developer who needs this data go through the same pain?
The answer was no. So I built APIs instead.
The easier way: skip the scraping entirely with APIs
Here’s what I realized after years of maintaining web scrapers in python: the data extraction is the easy part. The hard parts are everything around it.
Proxies. You need residential IP addresses that rotate on every request. That’s $5–15/GB depending on the provider.
Anti-bot evasion. Amazon changes their bot detection every few weeks. Google uses CAPTCHAs and rate limits. You’re in a constant arms race.
Country-specific accuracy. If you’re scraping Amazon Germany from a US IP address, you get US-localized data — wrong prices, wrong availability, wrong rankings. Every request needs to come from an IP inside the target country.
Monitoring. These platforms change their HTML structure without warning. Your scraper works on Tuesday, fails on Wednesday. Someone needs to notice and fix it fast.
That’s exactly why I built FlyByAPIs . It replaces 30+ lines of Selenium scraping code with a single API call that returns clean, structured JSON — handling proxies, anti-bot evasion, country-pinned IP routing, and ongoing maintenance behind the scenes.
What "country-pinned" means
Every request to our Amazon Scraper API is routed through an IP address inside that marketplace's country. Scraping Amazon Germany? The request goes through a German IP. Amazon Japan? Japanese IP. This ensures you get the exact same data a local shopper would see — correct prices, availability, and rankings.
Example: scraping Google search results with an API
Compare this to any of the Selenium examples above:
| |
That’s it. Your python web scraper goes from 30+ lines of Selenium code to a simple GET request that returns structured JSON with organic results, People Also Ask data, related searches — everything Google shows on the page.
The Google Search scraping API supports 250+ countries and languages, returns results in under 2 seconds, and starts with a free tier so you can test it before paying anything.
Example: getting Amazon product data
This is where the difference gets really obvious. Scraping Amazon with Selenium requires handling CAPTCHAs, rotating proxies, and fighting detection systems. With our Amazon web scraping API , it’s one request:
| |
You get structured data for every product — title, price, ratings, reviews, availability, ASIN, images, and more — across 22 Amazon marketplaces. FlyByAPIs routes every Amazon request through an IP address inside the target marketplace’s country — change "marketplace": "com" to "marketplace": "de" and you get German Amazon data from a German IP, with no proxy configuration on your end.
Example: extracting Google Maps data
Need business listings, reviews, or place details? The Google Maps scraper API works the same way:
| |
What we do behind the scenes
Since you’ve just read through the scraping difficulty ladder, you’ll appreciate what these APIs abstract away:
| Problem | DIY scraping | FlyByAPIs |
|---|---|---|
| IP blocking | Buy & rotate residential proxies ($5–15/GB) | Handled — included in every plan |
| Anti-bot detection | undetected-chromedriver + constant patching | Handled — we maintain the evasion stack |
| Country accuracy | Buy country-specific proxies per marketplace | Country-pinned IPs — automatic per request |
| HTML changes | Scraper breaks, you fix it manually | Data-drift monitoring — fixes ship in hours |
| Data format | Parse messy HTML into structured data yourself | Clean JSON — ready for your application |
| Monthly cost at scale | $200–500/mo in proxies + dev time | From $9.99/mo — free tier available |
We monitor every endpoint for data drift — if Amazon changes a CSS class name or Google modifies their SERP layout, we catch it and ship a fix within hours. You never have to update your code.
Available APIs
We have six APIs that cover the most common web scraping targets:
- Google Search API — organic results, People Also Ask, related searches, autocomplete. 250+ country/language combinations.
- Amazon Scraper API — product search, details, offers, reviews, deals, best sellers. 22 marketplaces with country-pinned IPs.
- Google Maps Extractor — business search, place details, reviews, photos. Great for lead generation and local SEO.
- Crunchbase Scraper API — company data, funding rounds, investors, acquisitions. 37 endpoints for startup and business intelligence.
- Translator API — multi-format translation powered by AI. Documents, text, HTML — not just plain strings.
- Jobs Search API — job listings aggregated from multiple sources. Useful for job boards, market research, and salary analysis.
All six are available on RapidAPI with free tiers. You can test any of them in under a minute — no credit card required.
Free tier — no credit card required
When to scrape vs. when to use an API
Not every scraping job needs an API. And not every scraping job should be DIY. Here’s how I think about it:
Scrape it yourself when:
- The target is a small, simple, static website
- You need data from a niche site that no API covers
- It’s a one-time job, not an ongoing data pipeline
- You’re learning and want to understand how scraping works (which is exactly why this tutorial exists)
Use an API when:
- The target has anti-bot protection (Amazon, Google, LinkedIn, etc.)
- You need data from multiple countries or marketplaces
- You need reliable, ongoing data for a production application
- The cost of maintaining a scraper exceeds the cost of an API subscription
- You’d rather spend your time building your actual product instead of fighting bot detection
I built both sides of this equation. I wrote web scrapers in python that lasted years with zero maintenance. I also wrote scrapers that needed daily babysitting and cost more in developer time than any API subscription ever would. The trick is being honest about which situation you’re in.
Quick reference
If you're building an Amazon price tracker or a deals alert bot, an API will save you weeks of proxy and anti-bot work. We have working tutorials for both — with full code you can deploy in an afternoon.
Putting it all together: choosing your approach
You’ve seen the full spectrum now. Here’s the decision tree I’d use:
Can you get the data with
requests? Try it first. Add headers if you get blocked. If the data is in the HTML response, you’re done — BeautifulSoup handles the rest.Is the content loaded by JavaScript? Switch to Selenium with headless Chrome. Wait for the content to render, then extract it.
Does the site actively block Selenium? Try
undetected-chromedriver. It patches the automation flags that bot detection systems look for.Is it a major platform with serious anti-bot systems? Consider whether building and maintaining a full scraping stack is worth your time. For Google, Amazon, Crunchbase , and Google Maps data, I’d point you toward our APIs — not because I built them (well, partly because I built them), but because I’ve been on both sides and I know what the maintenance looks like.
The best python web scraper is the one that matches the actual difficulty of what you’re scraping. Don’t bring Selenium to a static HTML fight, and don’t try to out-engineer Amazon’s anti-bot team with requests.
Every method in this tutorial works. I still use requests + BeautifulSoup for quick one-off scrapes. I still fire up Selenium when I need to interact with a dynamic page. But for anything that needs to run reliably in production — especially against sites that actively fight scrapers — I use the web scraping APIs I built
specifically because I got tired of the alternative.
Start simple, move up when you hit walls, and don’t be afraid to outsource the hard parts to someone who’s already solved them.
Free tier on all 5 APIs — no credit card required
Oriol.
