I spent a weekend building a Node.js scraper to pull product prices from five different e-commerce sites. Beautiful code. Clean selectors. Retry logic. The whole thing.
It worked perfectly for 11 days.
Then two sites changed their HTML structure, one started returning CAPTCHAs, and another rate-limited my IP. I spent more time fixing the scraper than I ever spent building it.
2
Approaches covered
Cheerio
For static sites
Puppeteer
For dynamic sites
~15 min
To your first scrape
That’s the thing nobody tells you about web scraping: writing the scraper is the easy part. Keeping it alive is where the real work begins.
Node.js web scraping is the process of programmatically extracting data from websites using JavaScript libraries like Cheerio (for static HTML) and Puppeteer (for JavaScript-rendered pages). It’s one of the most common data collection techniques in 2026 — though production-grade scraping increasingly relies on managed APIs to handle the hard parts.
But you need to learn it anyway. Understanding how scraping works makes you a better developer, and sometimes a quick scraper is exactly the right tool.
So here’s the honest tutorial: how to scrape any website with Node.js, with real code that actually works, what happens when it breaks, and when you should reach for an API instead.
TL;DR: Cheerio scrapes static pages in ~50ms with 30MB of memory. Puppeteer handles JavaScript-rendered sites but needs ~300MB RAM and 2-5 seconds per page. For production workloads against sites with anti-bot protection (Google, Amazon, Maps), a data API like FlyByAPIs replaces 200+ lines of scraping code with a single HTTP request — starting at $0/month with 200 free requests.
What you need before we start
You need Node.js 18+ installed. That’s it. Check with:
| |
If you see v18 or higher, you’re good. If not, grab the latest LTS from nodejs.org
.
Now create a project:
| |
We’ll install specific libraries as we go. First up: Cheerio for static websites, then Puppeteer for the tricky dynamic ones.
Static vs. dynamic — the one decision that matters
Right-click any webpage → View Page Source. If you can see the data you want in the raw HTML, it's static → use Cheerio. If the source is mostly empty <div id="root"></div> tags and JavaScript bundles, it's dynamic → use Puppeteer.
Scraping a static website with Cheerio
Cheerio is the jQuery of the server side. It parses HTML and lets you navigate the DOM with CSS selectors — fast, lightweight, and no browser needed.
Step 1: Install Cheerio
| |
Step 2: Fetch and parse a page
Let’s scrape the Hacker News front page — it’s server-rendered HTML, perfect for Cheerio.
Create a file called scraper.js:
| |
Step 3: Run it
| |
You should see something like:
| |
That’s your first Node.js web scraper in 15 lines of actual code.
For production use cases — like pulling search rankings daily — you’ll eventually want a SERP data extraction API instead. But learning to scrape websites by hand first is how you understand what those APIs abstract away.
Why Cheerio is fast
Cheerio doesn't launch a browser. It just parses HTML strings — no CSS rendering, no JavaScript execution, no images loaded. That makes it 10-20x faster than headless browser approaches and uses a fraction of the memory. For static pages, there's no reason to reach for anything heavier.
Step 4: Export to CSV
Scraped data sitting in your terminal isn’t useful. Let’s save it:
| |
This is the part most tutorials skip. Scraping without exporting is like fishing without a cooler.
Scraping a dynamic website with Puppeteer
Here’s where things get interesting. Many modern websites render content with JavaScript after the page loads — React, Vue, Angular apps, infinite scroll feeds, pages behind login forms. Cheerio sees an empty shell. You need a real browser.
Puppeteer launches a Chromium instance, navigates to the page, waits for JavaScript to run, and then gives you access to the fully rendered DOM.
Step 1: Install Puppeteer
| |
This downloads Chromium automatically (~200MB). Be patient.
Step 2: Scrape a JavaScript-rendered page
Let’s scrape quotes.toscrape.com/js
— a practice site that only renders its quotes via JavaScript. If you View Page Source, you’ll see an empty <div> and a <script> tag. Cheerio would get nothing.
Create dynamic-scraper.js:
| |
Step 3: Run it
| |
The output:
| |
Puppeteer waited for JavaScript to render the quotes, then extracted them just like you’d do in DevTools.
Cheerio vs Puppeteer — when to use which
| Factor | Cheerio | Puppeteer |
|---|---|---|
| Speed | Very fast (~50ms) | Slower (~2-5s) |
| Memory usage | ~30MB | ~300MB+ |
| JavaScript rendering | No | Yes |
| Form interaction | No | Yes |
| Screenshots | No | Yes |
| Best for | Blogs, docs, static pages | SPAs, dashboards, dynamic content |
Handling pagination with Puppeteer
Real scraping almost always involves multiple pages. Here’s how to scrape through paginated content:
| |
That setTimeout between pages is not optional. Hammering a server with rapid-fire requests is how you get your IP banned — a lesson every developer learns the first time they scrape a website without any throttling. I’ll talk more about that in a moment.
Adding retries and error handling
The tutorials above work on practice sites. Real websites are less cooperative. Here’s a production-ready request wrapper that handles the three things that will absolutely break your node js web scraper:
| |
This handles:
- Rate limiting (429): Exponential backoff — wait longer each time
- Server errors: Retry up to 3 times before giving up
- User-Agent headers: Some sites block requests without a browser-like User-Agent
It’s basic, but it’ll save you hours of debugging mysterious failures.
Pro tip: respect robots.txt
Before scraping any site, check https://example.com/robots.txt. It tells you which paths are off-limits to crawlers. Ignoring it won't crash your code, but it can get your IP permanently banned — and in some jurisdictions, it has legal implications.
When your scraper breaks (and it will)
I promised you the honest version. So here’s the reality: your scraper will break. Not “might” — will.
The four things that will kill your scraper
HTML structure changes
That .product-price selector that worked yesterday? The site just redesigned and now it's .price-display__current. Your scraper returns empty arrays. You can use data-testid attributes for more resilient selectors, but you're building on someone else's foundation — and they don't care.
Anti-bot detection
Amazon, Google, LinkedIn — they all actively block scrapers. They check headers, request patterns, JavaScript execution, datacenter IPs, and browser fingerprints. A basic Puppeteer scraper trips at least three of those signals. Getting past modern anti-bot systems is a full-time infrastructure job.
Rate limiting and IP bans
Even friendly sites block you if you send too many requests too fast. Some do it silently — returning stale data or redirecting to a CAPTCHA without changing the HTTP status code. Others hard-ban your IP for days.
Scale problems
Scraping 100 pages? Fine. Scraping 100,000 pages daily? You need proxy rotation, queue management, data deduplication, monitoring for broken selectors, and servers for headless browsers (each Puppeteer instance eats ~300MB RAM).
The hidden cost of DIY scraping at scale
Proxy costs
$50–500/month
Residential proxies for anti-bot evasion
Server costs
$20–200/month
VPS for running headless browsers 24/7
Maintenance time
5–15 hours/month
Fixing broken selectors and blocked IPs
CAPTCHA solving
$1–3 per 1,000
Third-party CAPTCHA services
I’m not saying this to scare you off. Scraping is a genuine skill and sometimes it’s the only option. But you should know the real cost before you commit to maintaining a scraping pipeline in production.
When to use an API instead of scraping
Here’s what I wish someone had told me before I spent that weekend building scrapers: if someone already built and maintains the scraping infrastructure for you, just use their API.
Think about what a scraping API does:
- Handles proxy rotation and IP management
- Solves CAPTCHAs automatically
- Adapts when the target site changes their HTML
- Returns clean, structured JSON instead of raw HTML you have to parse
- Scales without you managing servers
Compare what scraping Google search results looks like with Puppeteer vs. using the FlyByAPIs Google Search API :
DIY scraping vs API — same data, different effort
Scraping Google yourself
- Launch headless browser (~300MB RAM)
- Handle Google's anti-bot (reCAPTCHA)
- Rotate residential proxies
- Parse the ever-changing HTML
- Handle rate limits and retries
- Maintain when Google updates layout
~200 lines of code + infrastructure
Using a search API
- One HTTP request
- Get structured JSON back
- Organic results, PAA, snippets included
- 250 countries, 150 languages
- No proxy management
- Maintained by someone else
~10 lines of code, zero infrastructure
Here’s the API version — 10 lines that replace 200+ lines of scraping code:
| |
That’s it. No Puppeteer, no proxy rotation, no CAPTCHA solving. You get structured data — titles, URLs, snippets, People Also Ask, featured snippets — all as clean JSON.
FlyByAPIs Google Search API covers 250 countries and 150 languages. Organic results, knowledge panels, and People Also Ask data in a single request that takes under 2 seconds.
The same logic applies to Amazon product data . Scraping Amazon is particularly painful — they have some of the most aggressive anti-bot systems on the internet.
I wrote a full guide on scraping Amazon with Python , and in that post I show how the Amazon Product Data API returns the same data with a single request across all 22 marketplaces.
The decision framework
Not every scraping job needs an API. Here’s how I decide:
| Situation | Best approach | Why |
|---|---|---|
| One-time data grab from a simple site | Cheerio scraper | Quick, free, no maintenance needed |
| Scraping a JS-heavy site once | Puppeteer scraper | Gets the job done, toss the script after |
| Daily data from Google/Amazon/Maps | Google Search data API | Anti-bot systems make DIY unsustainable |
| Production pipeline, 10K+ requests/day | Amazon scraping API | Proxy + server costs exceed API pricing |
| Internal company tool or intranet | Custom scraper | No anti-bot, you control the source |
The breakeven point is roughly this: if you’re going to scrape the same major website more than once a week, an API will save you time and money within the first month.
The Google Search API starts at $0/month with 200 free requests — enough to prototype before committing.
For things like Google Maps data extraction , the calculation is even clearer. Google Maps is a fully client-rendered app — Cheerio is useless and Puppeteer needs constant babysitting to handle their authentication prompts and dynamic loading patterns.
Best practices for web scraping with Node.js
Whether you scrape websites as a one-off or run a production pipeline, these rules will keep your Node.js scrapers reliable:
1. Add delays between requests. At minimum, 1-2 seconds. Some sites need 3-5 seconds. Randomize the delay so your pattern doesn’t look automated.
| |
2. Set realistic headers. At minimum, include a User-Agent that looks like a real browser. Better yet, rotate through a few different ones.
3. Handle errors gracefully. Network requests fail. Pages change. Selectors break. Your scraper should log the error, skip the page, and keep going — not crash on the first 404.
4. Check robots.txt first. It’s the website’s stated policy on what’s fair game for automated access. Respecting it isn’t just polite — it’s often legally relevant.
When scraping isn't worth the effort
For high-value data sources like Crunchbase company profiles, the site has aggressive anti-scraping measures and complex JavaScript rendering. A dedicated Crunchbase API saves weeks of reverse-engineering their protection layers.
5. Cache aggressively. If you’re developing and testing your scraper, save the HTML locally after the first fetch. Parse from the cached file instead of hitting the server every time you tweak a selector.
| |
6. Know when to stop. If you’re spending more time maintaining your scraper than actually using the data, it’s time to switch to a purpose-built data API . I’ve been there. The sunk cost fallacy is real.
200 requests/month free · No credit card required
Quick reference: the complete toolkit
Here’s everything you need in one place:
| Tool | Install | Use for |
|---|---|---|
| Cheerio | npm install cheerio | Parsing static HTML, fast and lightweight |
| Puppeteer | npm install puppeteer | Dynamic sites, JS rendering, screenshots |
| Playwright | npm install playwright | Cross-browser alternative to Puppeteer |
| FlyByAPIs | No install — HTTP API | Production data from Google, Amazon, Maps |
Wrapping up
You now know how to scrape any website with Node.js — static pages with Cheerio, dynamic pages with Puppeteer, and how to handle the errors and blocks that inevitably come.
The honest truth: scraping is a fantastic skill for one-off data grabs, prototyping, and understanding how the web works under the hood. I still write quick scrapers all the time.
But for anything running in production — anything where you need the data to show up reliably tomorrow and next month and six months from now — the maintenance cost adds up fast. DIY scraping at scale costs $70-700/month in proxies and servers alone, before counting 5-15 hours of monthly maintenance.
That’s why we built APIs like our Google Search API , Amazon data API , and Google Maps scraping API — so you can spend time building your product instead of babysitting scrapers.
If you want to see the difference, the free tier gets you 200 requests/month with no credit card. Try scraping Google results yourself, then try the API. The comparison sells itself.
200 free requests/month · Structured JSON · No proxy headaches
Now go build something.
Oriol.
