Python Web Scraping: How to Scrape Any Website with Python

Learn Python web scraping from first request to production data. Covers requests, BeautifulSoup, headers, Selenium, and when an API saves you months of work.

pip install requests beautifulsoup4 — that’s where every python web scraping tutorial starts. And for about 30 minutes, everything works perfectly.

You fetch a page, parse the HTML, pull out the data you need. This is easy, you think. Then you try a real website and get a 403 Forbidden. Or worse: the page loads fine but the data you need simply isn’t in the HTML because it’s loaded by JavaScript after the page renders.

5

Methods covered

Python

All examples in Python

Copy-paste

Working code examples

Beginner →

Progressive difficulty

Web scraping in Python is the practice of programmatically fetching web pages and extracting structured data from HTML — using libraries like requests for downloading and BeautifulSoup for parsing. For sites with heavy anti-bot protection, a dedicated data API can replace the scraping logic entirely with a simple call that returns clean JSON.

I’ve been building Python web scraping infrastructure for over three years. I’ve written scrapers that worked beautifully for a week and scrapers that broke before I finished my coffee. The difference between those two outcomes comes down to understanding which tool fits which problem.

TL;DR: This tutorial covers 5 Python web scraping methods — from requests + BeautifulSoup for static pages to Selenium for JavaScript-heavy sites. For production scraping against anti-bot targets like Amazon and Google, FlyByAPIs data APIs replace 30+ lines of Selenium code with a single GET request, starting free on RapidAPI.

This isn’t a tool comparison (that’s a [separate article coming soon]). This is the practical how-to — code you can run right now, with honest explanations of where each approach breaks down.

What is web scraping (and when do you need it)?

Web scraping is pulling data from websites programmatically. Instead of copying and pasting by hand, you write code that does it for you — fetching pages, parsing the HTML, and extracting the pieces you care about.

You need it when:

  • A website has data you want but no API to access it
  • You’re tracking prices, monitoring competitors, or collecting research data
  • You need to scrape web data from hundreds or thousands of pages — way more than any human could do manually

Python is the go-to language for web scraping because of its ecosystem. The best python tools for web scraping — requests, BeautifulSoup, Selenium, and Scrapy — cover every level of complexity. And Python’s readable syntax means your scraping scripts stay maintainable even months later.

But here’s the thing I wish someone had told me earlier: the difficulty of scraping a website has almost nothing to do with the data you want. It depends entirely on how the site is built and how hard it fights back against automated access.

A static blog? Ten lines of code. Amazon product data? That’s a full engineering project .

Let me show you the progression.


Scraping static pages with requests and BeautifulSoup

This is level one. The simplest web scraping setup in Python — and honestly, it handles more websites than you’d expect.

First, install both libraries:

1
pip install requests beautifulsoup4

Here’s a working example that scrapes article titles from a page:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import requests
from bs4 import BeautifulSoup

url = "https://news.ycombinator.com/"
response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

titles = soup.select(".titleline > a")
for title in titles:
    print(title.text)

That’s it. Six lines of actual code. requests.get() downloads the HTML, BeautifulSoup parses it into a navigable tree, and .select() finds elements using CSS selectors — the same selectors you’d use in your browser’s DevTools.

How to find the right selector

Open the page in Chrome, right-click the element you want, and click Inspect. You'll see the HTML structure. Look for class names, IDs, or tag patterns that uniquely identify your target data. Then use those in soup.select() or soup.find().

A few more methods you’ll use constantly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Find by ID
element = soup.find(id="main-content")

# Find by tag + class
items = soup.find_all("div", class_="product-card")

# Get text content
title = soup.find("h1").get_text(strip=True)

# Get an attribute (like href from a link)
link = soup.find("a")["href"]

Saving scraped data to CSV

Once you’ve extracted data, you’ll probably want to save it. Python’s built-in csv module works fine:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import csv
import requests
from bs4 import BeautifulSoup

url = "https://news.ycombinator.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

with open("titles.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["position", "title", "url"])

    for i, link in enumerate(soup.select(".titleline > a"), 1):
        writer.writerow([i, link.text, link.get("href", "")])

print("Saved to titles.csv")

This approach works great for blogs, news sites, documentation pages, forums, and any website where the content is in the HTML that the server sends. No JavaScript rendering needed — requests downloads the raw HTML, and everything you need is right there.

But not every website is that cooperative.


Why some websites block your scraper (and how to fix it with headers)

You write a perfectly working scraper. It works on one site. You point it at another site and get a 403 Forbidden or a 406 Not Acceptable. What happened?

The answer is almost always headers.

When your browser visits a website, it sends along a bunch of metadata: what browser you’re using, what languages you accept, what site referred you. These are HTTP headers. When requests.get() sends a request with no headers, the server sees something like this:

1
User-Agent: python-requests/2.31.0

That’s a dead giveaway. The server knows you’re a script, not a browser. Many sites block requests like this automatically.

The fix is simple — add realistic headers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
}

url = "https://example.com/products"
response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, "html.parser")
    # parse your data...
else:
    print(f"Blocked: {response.status_code}")

Pro tip

Copy the exact headers your browser sends. Open DevTools → Network tab → click any request → scroll to "Request Headers". Copy the User-Agent, Accept, and Accept-Language values. These are the headers that real browsers send — using them makes your requests look legitimate.

Being a good scraping citizen

Headers get you past the first gate. But if you then blast a server with 100 requests per second, you’ll get blocked for a different reason — rate limiting.

1
2
3
4
5
6
7
8
9
import time
import random

urls = ["https://example.com/page/1", "https://example.com/page/2", "..."]

for url in urls:
    response = requests.get(url, headers=headers)
    # process the response...
    time.sleep(random.uniform(1, 3))  # random delay between requests

A random delay between 1–3 seconds mimics human browsing behavior and keeps most servers happy. Also check the site’s robots.txt (add /robots.txt to any domain) to see what the site owner allows.

These two things — proper headers and polite delays — solve 80% of blocking issues when you scrape websites with python. But there’s a whole category of websites where headers alone aren’t enough.


Scraping dynamic websites with Selenium

Here’s the moment that trips up every beginner: you inspect a page in your browser, see all the data right there in the DOM, but when you download it with requests the data is missing.

The reason? The content is loaded by JavaScript after the initial HTML loads. Modern web frameworks like React, Angular, and Vue build the page dynamically in the browser. The server sends a mostly empty HTML shell, and JavaScript fills it in.

Key distinction

requests only downloads that empty shell. It doesn't execute JavaScript. For sites that render content client-side, you need a real browser — that's where Selenium comes in.

Selenium controls an actual Chrome (or Firefox) browser programmatically — navigating pages, waiting for JavaScript to render, clicking buttons, filling forms. Your Python code drives the browser like a puppet.

Install Selenium:

1
pip install selenium

You’ll also need ChromeDriver installed and matching your Chrome version. The easiest way in 2026:

1
pip install webdriver-manager

Here’s a basic example that scrapes a JavaScript-rendered page:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

options = Options()
options.add_argument("--headless=new")
options.add_argument("--no-sandbox")

driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=options
)

try:
    driver.get("https://quotes.toscrape.com/js/")

    # Wait for JavaScript to render the content
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "quote"))
    )

    quotes = driver.find_elements(By.CLASS_NAME, "quote")
    for quote in quotes:
        text = quote.find_element(By.CLASS_NAME, "text").text
        author = quote.find_element(By.CLASS_NAME, "author").text
        print(f"{author}: {text[:80]}...")

finally:
    driver.quit()

Notice the --headless=new flag. That’s important — let’s talk about it.


Headless mode: running Selenium without a visible browser

Headless mode means the browser runs in the background with no visible window. Same rendering engine, same JavaScript execution, but no GUI. This is what you want for production scraping and server environments.

The --headless=new argument in Chrome enables the new headless mode (Chrome 112+), which behaves identically to a regular browser window — just invisible.

1
2
3
4
5
options = Options()
options.add_argument("--headless=new")        # no visible browser window
options.add_argument("--disable-gpu")         # recommended for headless
options.add_argument("--window-size=1920,1080")  # set a realistic viewport
options.add_argument("--no-sandbox")

Why does this matter? Two reasons:

  1. Speed. Headless mode is faster because it doesn’t need to actually draw pixels on screen.
  2. Servers. If you’re running a scraper on a cloud server or in a Docker container, there’s no display available. Headless is your only option.

Watch out

Some websites detect headless browsers by checking for certain JavaScript properties that differ between headless and regular Chrome. If a site works in your browser but fails in headless mode, that's likely why. The next section covers a solution.

When Selenium itself isn’t enough

For most websites, Selenium in headless mode works. But the big platforms — Amazon, Google, LinkedIn, major e-commerce sites — have anti-bot systems that specifically detect and block Selenium.

They detect it because default Selenium sets a navigator.webdriver flag to true in the browser, among other tells. Anti-bot services like Cloudflare, DataDome, and PerimeterX check for these flags automatically — and block your scraper before it extracts a single byte.


Undetected chromedriver: getting past anti-bot detection

For sites with serious bot detection, there’s undetected-chromedriver. It patches Selenium to remove the automation flags that anti-bot systems look for.

1
pip install undetected-chromedriver
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import undetected_chromedriver as uc

options = uc.ChromeOptions()
options.add_argument("--headless=new")

driver = uc.Chrome(options=options)

try:
    driver.get("https://example-protected-site.com")
    # now you can scrape pages that would block regular Selenium
    print(driver.page_source[:500])
finally:
    driver.quit()

The API is almost identical to regular Selenium — you just swap webdriver.Chrome for uc.Chrome. Under the hood, it patches the ChromeDriver binary to remove detectable traces.

I’m keeping this brief on purpose. undetected-chromedriver solves one specific problem (automation detection), but it doesn’t solve the bigger problems of production scraping: IP rotation, CAPTCHA solving, maintaining sessions across hundreds of requests, and dealing with sites that change their HTML structure every few weeks.

The scraping difficulty ladder

1

Static HTML pages

requests + BeautifulSoup. Done in 10 lines. Works reliably forever.

2

Sites that check headers

Add a realistic User-Agent and request headers. Still just requests + BS4.

3

JavaScript-rendered content

Selenium in headless mode. Slower, heavier, but handles dynamic pages.

4

Anti-bot protected sites

Undetected chromedriver + proxies + delays + ongoing maintenance. Gets expensive fast.

5

Major platforms (Amazon, Google, LinkedIn)

Full anti-bot evasion stack, rotating residential proxies, constant maintenance. Or... use an API.

I’ve spent three years on level 5. Building and maintaining scrapers for Google Search results, Amazon product data, Google Maps listings, and more. And at some point I had to ask myself: is it worth having every developer who needs this data go through the same pain?

The answer was no. So I built APIs instead.


The easier way: skip the scraping entirely with APIs

Here’s what I realized after years of maintaining web scrapers in python: the data extraction is the easy part. The hard parts are everything around it.

Proxies. You need residential IP addresses that rotate on every request. That’s $5–15/GB depending on the provider.

Anti-bot evasion. Amazon changes their bot detection every few weeks. Google uses CAPTCHAs and rate limits. You’re in a constant arms race.

Country-specific accuracy. If you’re scraping Amazon Germany from a US IP address, you get US-localized data — wrong prices, wrong availability, wrong rankings. Every request needs to come from an IP inside the target country.

Monitoring. These platforms change their HTML structure without warning. Your scraper works on Tuesday, fails on Wednesday. Someone needs to notice and fix it fast.

That’s exactly why I built FlyByAPIs . It replaces 30+ lines of Selenium scraping code with a single API call that returns clean, structured JSON — handling proxies, anti-bot evasion, country-pinned IP routing, and ongoing maintenance behind the scenes.

What "country-pinned" means

Every request to our Amazon Scraper API is routed through an IP address inside that marketplace's country. Scraping Amazon Germany? The request goes through a German IP. Amazon Japan? Japanese IP. This ensures you get the exact same data a local shopper would see — correct prices, availability, and rankings.

Example: scraping Google search results with an API

Compare this to any of the Selenium examples above:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import requests

url = "https://google-serp-search-api.p.rapidapi.com/search"

params = {
    "q": "best restaurants in Barcelona",
    "num": "10",
    "gl": "es",
    "hl": "en"
}

headers = {
    "X-RapidAPI-Key": "YOUR_API_KEY",
    "X-RapidAPI-Host": "google-serp-search-api.p.rapidapi.com"
}

response = requests.get(url, headers=headers, params=params)
data = response.json()

for result in data["data"]["organic_results"]:
    print(f"{result['position']}. {result['title']}")
    print(f"   {result['link']}")
    print()

That’s it. Your python web scraper goes from 30+ lines of Selenium code to a simple GET request that returns structured JSON with organic results, People Also Ask data, related searches — everything Google shows on the page.

The Google Search scraping API supports 250+ countries and languages, returns results in under 2 seconds, and starts with a free tier so you can test it before paying anything.

Example: getting Amazon product data

This is where the difference gets really obvious. Scraping Amazon with Selenium requires handling CAPTCHAs, rotating proxies, and fighting detection systems. With our Amazon web scraping API , it’s one request:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import requests

url = "https://real-time-amazon-data-the-most-complete.p.rapidapi.com/search"

params = {
    "query": "wireless headphones",
    "marketplace": "com",
    "sort_by": "RELEVANCE"
}

headers = {
    "X-RapidAPI-Key": "YOUR_API_KEY",
    "X-RapidAPI-Host": "real-time-amazon-data-the-most-complete.p.rapidapi.com"
}

response = requests.get(url, headers=headers, params=params)
data = response.json()

for product in data["data"]["products"]:
    print(f"{product['title']}")
    print(f"  Price: {product['price']}")
    print(f"  Rating: {product['rating']} ({product['reviews_count']} reviews)")
    print(f"  ASIN: {product['asin']}")
    print()

You get structured data for every product — title, price, ratings, reviews, availability, ASIN, images, and more — across 22 Amazon marketplaces. FlyByAPIs routes every Amazon request through an IP address inside the target marketplace’s country — change "marketplace": "com" to "marketplace": "de" and you get German Amazon data from a German IP, with no proxy configuration on your end.

Example: extracting Google Maps data

Need business listings, reviews, or place details? The Google Maps scraper API works the same way:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import requests

url = "https://google-maps-extractor2.p.rapidapi.com/locate_and_search"

params = {
    "query": "coffee shops in Austin TX",
    "limit": "10"
}

headers = {
    "X-RapidAPI-Key": "YOUR_API_KEY",
    "X-RapidAPI-Host": "google-maps-extractor2.p.rapidapi.com"
}

response = requests.get(url, headers=headers, params=params)
data = response.json()

for place in data["data"]:
    print(f"{place['name']}{place['rating']}★ ({place['reviews_count']} reviews)")
    print(f"  {place['address']}")
    print()

What we do behind the scenes

Since you’ve just read through the scraping difficulty ladder, you’ll appreciate what these APIs abstract away:

ProblemDIY scrapingFlyByAPIs
IP blockingBuy & rotate residential proxies ($5–15/GB)Handled — included in every plan
Anti-bot detectionundetected-chromedriver + constant patchingHandled — we maintain the evasion stack
Country accuracyBuy country-specific proxies per marketplaceCountry-pinned IPs — automatic per request
HTML changesScraper breaks, you fix it manuallyData-drift monitoring — fixes ship in hours
Data formatParse messy HTML into structured data yourselfClean JSON — ready for your application
Monthly cost at scale$200–500/mo in proxies + dev timeFrom $9.99/mo — free tier available

We monitor every endpoint for data drift — if Amazon changes a CSS class name or Google modifies their SERP layout, we catch it and ship a fix within hours. You never have to update your code.

Available APIs

We have six APIs that cover the most common web scraping targets:

  • Google Search API — organic results, People Also Ask, related searches, autocomplete. 250+ country/language combinations.
  • Amazon Scraper API — product search, details, offers, reviews, deals, best sellers. 22 marketplaces with country-pinned IPs.
  • Google Maps Extractor — business search, place details, reviews, photos. Great for lead generation and local SEO.
  • Crunchbase Scraper API — company data, funding rounds, investors, acquisitions. 37 endpoints for startup and business intelligence.
  • Translator API — multi-format translation powered by AI. Documents, text, HTML — not just plain strings.
  • Jobs Search API — job listings aggregated from multiple sources. Useful for job boards, market research, and salary analysis.

All six are available on RapidAPI with free tiers. You can test any of them in under a minute — no credit card required.

Try the Google Search API free →

Free tier — no credit card required


When to scrape vs. when to use an API

Not every scraping job needs an API. And not every scraping job should be DIY. Here’s how I think about it:

Scrape it yourself when:

  • The target is a small, simple, static website
  • You need data from a niche site that no API covers
  • It’s a one-time job, not an ongoing data pipeline
  • You’re learning and want to understand how scraping works (which is exactly why this tutorial exists)

Use an API when:

  • The target has anti-bot protection (Amazon, Google, LinkedIn, etc.)
  • You need data from multiple countries or marketplaces
  • You need reliable, ongoing data for a production application
  • The cost of maintaining a scraper exceeds the cost of an API subscription
  • You’d rather spend your time building your actual product instead of fighting bot detection

I built both sides of this equation. I wrote web scrapers in python that lasted years with zero maintenance. I also wrote scrapers that needed daily babysitting and cost more in developer time than any API subscription ever would. The trick is being honest about which situation you’re in.

Quick reference

If you're building an Amazon price tracker or a deals alert bot, an API will save you weeks of proxy and anti-bot work. We have working tutorials for both — with full code you can deploy in an afternoon.


Putting it all together: choosing your approach

You’ve seen the full spectrum now. Here’s the decision tree I’d use:

  1. Can you get the data with requests? Try it first. Add headers if you get blocked. If the data is in the HTML response, you’re done — BeautifulSoup handles the rest.

  2. Is the content loaded by JavaScript? Switch to Selenium with headless Chrome. Wait for the content to render, then extract it.

  3. Does the site actively block Selenium? Try undetected-chromedriver. It patches the automation flags that bot detection systems look for.

  4. Is it a major platform with serious anti-bot systems? Consider whether building and maintaining a full scraping stack is worth your time. For Google, Amazon, Crunchbase , and Google Maps data, I’d point you toward our APIs — not because I built them (well, partly because I built them), but because I’ve been on both sides and I know what the maintenance looks like.

The best python web scraper is the one that matches the actual difficulty of what you’re scraping. Don’t bring Selenium to a static HTML fight, and don’t try to out-engineer Amazon’s anti-bot team with requests.


Every method in this tutorial works. I still use requests + BeautifulSoup for quick one-off scrapes. I still fire up Selenium when I need to interact with a dynamic page. But for anything that needs to run reliably in production — especially against sites that actively fight scrapers — I use the web scraping APIs I built specifically because I got tired of the alternative.

Start simple, move up when you hit walls, and don’t be afraid to outsource the hard parts to someone who’s already solved them.

Try FlyByAPIs free on RapidAPI →

Free tier on all 5 APIs — no credit card required

Oriol.

FAQ

Frequently Asked Questions

Q Is web scraping with Python legal?

Scraping publicly available data is generally legal — the 2022 hiQ v. LinkedIn ruling confirmed that scraping public pages doesn't violate the CFAA. However, always check a site's robots.txt and Terms of Service before scraping. For production workloads, a dedicated data API like FlyByAPIs eliminates legal gray areas entirely because you're accessing data through an authorized interface.

Q What is the best Python library for web scraping?

For static HTML pages, the requests + BeautifulSoup combination is the simplest and fastest option. For JavaScript-heavy sites, Selenium is the standard choice. For large-scale crawling with concurrency, Scrapy is the most powerful framework. For anti-bot-protected sites like Amazon or Google, a data API like FlyByAPIs skips scraping entirely and returns structured JSON.

Q Can I scrape any website with Python?

Technically yes, but the difficulty varies enormously. Simple static sites take 10 lines of code. Sites with JavaScript rendering need Selenium or Playwright. Sites with aggressive anti-bot systems like Amazon, Google, or LinkedIn require rotating proxies, browser fingerprint management, and constant maintenance. For these targets, FlyByAPIs data APIs handle the proxies, anti-bot evasion, and country-pinned IP routing for you.

Q Why does my Python scraper get blocked?

Most sites block scrapers that send requests without proper HTTP headers — especially a missing or default User-Agent string. Other common causes: too many requests too fast (no delays between requests), not rotating IP addresses, and using detectable browser automation. Adding realistic headers and delays fixes most blocking issues on simple sites.

Q How do I scrape a JavaScript-rendered website with Python?

Use Selenium with a headless Chrome browser. Install selenium and chromedriver, create a headless browser instance, navigate to the page, wait for JavaScript to render the content, then extract data from the fully rendered DOM. For sites with anti-bot protection, undetected-chromedriver patches Selenium to avoid detection flags.

Q What is the difference between requests and Selenium for web scraping?

The requests library downloads raw HTML — fast and lightweight, but it can't execute JavaScript. Selenium controls a real browser that renders JavaScript, handles dynamic content, and can click buttons or fill forms. Use requests for static sites, Selenium for anything that loads content dynamically. An API like FlyByAPIs handles both scenarios and returns clean JSON.

Q How much does it cost to run a Python web scraper in production?

DIY scraping costs add up fast: residential proxies run $5-15/GB, you need rotating IPs to avoid blocks, plus developer time for maintenance when sites change their HTML. A typical scraping operation costs $200-500/month in infrastructure. FlyByAPIs data APIs start free and cover most use cases for $9.99-49.99/month with no proxy or maintenance costs.

Q How do I export scraped data to CSV in Python?

Use Python's built-in csv module or the pandas library. With csv: open a file, create a csv.writer, write headers with writerow(), then loop through your scraped data writing each row. With pandas: put your data in a DataFrame and call df.to_csv('output.csv', index=False). Pandas is easier for large datasets and handles encoding issues automatically.
Share this article
Oriol Marti
Oriol Marti
Founder & CEO

Computer engineer and entrepreneur based in Andorra. Founder and CEO of FlyByAPIs, building reliable web data APIs for developers worldwide.

Free tier available

Ready to stop maintaining scrapers?

Production-ready APIs for web data extraction. Whatever you're building, up and running in minutes.

Start for free on RapidAPI