How to Scrape Any Website with Node.js (Even If You've Never Done It)

Learn to scrape any website with Node.js using Cheerio and Puppeteer. Step-by-step tutorial with working code, plus when to use an API instead.

I spent a weekend building a Node.js scraper to pull product prices from five different e-commerce sites. Beautiful code. Clean selectors. Retry logic. The whole thing.

It worked perfectly for 11 days.

Then two sites changed their HTML structure, one started returning CAPTCHAs, and another rate-limited my IP. I spent more time fixing the scraper than I ever spent building it.

2

Approaches covered

Cheerio

For static sites

Puppeteer

For dynamic sites

~15 min

To your first scrape

That’s the thing nobody tells you about web scraping: writing the scraper is the easy part. Keeping it alive is where the real work begins.

Node.js web scraping is the process of programmatically extracting data from websites using JavaScript libraries like Cheerio (for static HTML) and Puppeteer (for JavaScript-rendered pages). It’s one of the most common data collection techniques in 2026 — though production-grade scraping increasingly relies on managed APIs to handle the hard parts.

But you need to learn it anyway. Understanding how scraping works makes you a better developer, and sometimes a quick scraper is exactly the right tool.

So here’s the honest tutorial: how to scrape any website with Node.js, with real code that actually works, what happens when it breaks, and when you should reach for an API instead.

TL;DR: Cheerio scrapes static pages in ~50ms with 30MB of memory. Puppeteer handles JavaScript-rendered sites but needs ~300MB RAM and 2-5 seconds per page. For production workloads against sites with anti-bot protection (Google, Amazon, Maps), a data API like FlyByAPIs replaces 200+ lines of scraping code with a single HTTP request — starting at $0/month with 200 free requests.


What you need before we start

You need Node.js 18+ installed. That’s it. Check with:

1
node --version

If you see v18 or higher, you’re good. If not, grab the latest LTS from nodejs.org .

Now create a project:

1
2
mkdir my-scraper && cd my-scraper
npm init -y

We’ll install specific libraries as we go. First up: Cheerio for static websites, then Puppeteer for the tricky dynamic ones.

Static vs. dynamic — the one decision that matters

Right-click any webpage → View Page Source. If you can see the data you want in the raw HTML, it's static → use Cheerio. If the source is mostly empty <div id="root"></div> tags and JavaScript bundles, it's dynamic → use Puppeteer.


Scraping a static website with Cheerio

Cheerio is the jQuery of the server side. It parses HTML and lets you navigate the DOM with CSS selectors — fast, lightweight, and no browser needed.

Step 1: Install Cheerio

1
npm install cheerio

Step 2: Fetch and parse a page

Let’s scrape the Hacker News front page — it’s server-rendered HTML, perfect for Cheerio.

Create a file called scraper.js:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
const cheerio = require("cheerio");

async function scrapeHackerNews() {
  const response = await fetch("https://news.ycombinator.com");
  const html = await response.text();
  const $ = cheerio.load(html);

  const stories = [];

  $(".athing").each((i, element) => {
    const title = $(element).find(".titleline > a").text();
    const url = $(element).find(".titleline > a").attr("href");
    const rank = $(element).find(".rank").text().replace(".", "");

    const subtext = $(element).next();
    const points = subtext.find(".score").text();
    const author = subtext.find(".hnuser").text();

    stories.push({ rank, title, url, points, author });
  });

  return stories;
}

scrapeHackerNews().then((stories) => {
  console.log(`Found ${stories.length} stories:\n`);
  stories.slice(0, 5).forEach((s) => {
    console.log(`#${s.rank} ${s.title}`);
    console.log(`   ${s.points} by ${s.author}`);
    console.log(`   ${s.url}\n`);
  });
});

Step 3: Run it

1
node scraper.js

You should see something like:

1
2
3
4
5
6
7
8
9
Found 30 stories:

#1 Show HN: I built a tool to visualize Git history
   142 points by devuser
   https://github.com/example/repo

#2 The hidden cost of microservices
   89 points by techwriter
   https://blog.example.com/microservices

That’s your first Node.js web scraper in 15 lines of actual code.

For production use cases — like pulling search rankings daily — you’ll eventually want a SERP data extraction API instead. But learning to scrape websites by hand first is how you understand what those APIs abstract away.

Why Cheerio is fast

Cheerio doesn't launch a browser. It just parses HTML strings — no CSS rendering, no JavaScript execution, no images loaded. That makes it 10-20x faster than headless browser approaches and uses a fraction of the memory. For static pages, there's no reason to reach for anything heavier.

Step 4: Export to CSV

Scraped data sitting in your terminal isn’t useful. Let’s save it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
const fs = require("fs");

function toCSV(stories) {
  const header = "rank,title,url,points,author";
  const rows = stories.map(
    (s) =>
      `${s.rank},"${s.title.replace(/"/g, '""')}","${s.url}","${s.points}","${s.author}"`
  );
  return [header, ...rows].join("\n");
}

// After scraping:
scrapeHackerNews().then((stories) => {
  fs.writeFileSync("stories.csv", toCSV(stories));
  console.log(`Saved ${stories.length} stories to stories.csv`);
});

This is the part most tutorials skip. Scraping without exporting is like fishing without a cooler.


Scraping a dynamic website with Puppeteer

Here’s where things get interesting. Many modern websites render content with JavaScript after the page loads — React, Vue, Angular apps, infinite scroll feeds, pages behind login forms. Cheerio sees an empty shell. You need a real browser.

Puppeteer launches a Chromium instance, navigates to the page, waits for JavaScript to run, and then gives you access to the fully rendered DOM.

Step 1: Install Puppeteer

1
npm install puppeteer

This downloads Chromium automatically (~200MB). Be patient.

Step 2: Scrape a JavaScript-rendered page

Let’s scrape quotes.toscrape.com/js — a practice site that only renders its quotes via JavaScript. If you View Page Source, you’ll see an empty <div> and a <script> tag. Cheerio would get nothing.

Create dynamic-scraper.js:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
const puppeteer = require("puppeteer");

async function scrapeQuotes() {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto("https://quotes.toscrape.com/js/", {
    waitUntil: "networkidle2",
  });

  const quotes = await page.evaluate(() => {
    return Array.from(document.querySelectorAll(".quote")).map((el) => ({
      text: el.querySelector(".text").textContent,
      author: el.querySelector(".author").textContent,
      tags: Array.from(el.querySelectorAll(".tag")).map((t) => t.textContent),
    }));
  });

  await browser.close();
  return quotes;
}

scrapeQuotes().then((quotes) => {
  console.log(`Found ${quotes.length} quotes:\n`);
  quotes.slice(0, 3).forEach((q) => {
    console.log(`"${q.text}"`);
    console.log(`  — ${q.author} [${q.tags.join(", ")}]\n`);
  });
});

Step 3: Run it

1
node dynamic-scraper.js

The output:

1
2
3
4
5
6
7
8
9
Found 10 quotes:

""The world as we have created it is a process of our thinking.
It cannot be changed without changing our thinking.""
  — Albert Einstein [change, deep-thoughts, thinking, world]

""It is our choices, Harry, that show what we truly are,
far more than our abilities.""
  — J.K. Rowling [abilities, choices]

Puppeteer waited for JavaScript to render the quotes, then extracted them just like you’d do in DevTools.

Cheerio vs Puppeteer — when to use which

FactorCheerioPuppeteer
SpeedVery fast (~50ms)Slower (~2-5s)
Memory usage~30MB~300MB+
JavaScript renderingNoYes
Form interactionNoYes
ScreenshotsNoYes
Best forBlogs, docs, static pagesSPAs, dashboards, dynamic content

Handling pagination with Puppeteer

Real scraping almost always involves multiple pages. Here’s how to scrape through paginated content:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
async function scrapeAllPages() {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  const allQuotes = [];

  let pageNum = 1;
  let hasNext = true;

  while (hasNext) {
    await page.goto(`https://quotes.toscrape.com/js/page/${pageNum}/`, {
      waitUntil: "networkidle2",
    });

    const quotes = await page.evaluate(() =>
      Array.from(document.querySelectorAll(".quote")).map((el) => ({
        text: el.querySelector(".text").textContent,
        author: el.querySelector(".author").textContent,
      }))
    );

    if (quotes.length === 0) {
      hasNext = false;
    } else {
      allQuotes.push(...quotes);
      console.log(`Page ${pageNum}: ${quotes.length} quotes`);
      pageNum++;
    }

    // Be polite — wait between requests
    await new Promise((r) => setTimeout(r, 1000));
  }

  await browser.close();
  console.log(`\nTotal: ${allQuotes.length} quotes from ${pageNum - 1} pages`);
  return allQuotes;
}

That setTimeout between pages is not optional. Hammering a server with rapid-fire requests is how you get your IP banned — a lesson every developer learns the first time they scrape a website without any throttling. I’ll talk more about that in a moment.


Adding retries and error handling

The tutorials above work on practice sites. Real websites are less cooperative. Here’s a production-ready request wrapper that handles the three things that will absolutely break your node js web scraper:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
async function fetchWithRetry(url, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const response = await fetch(url, {
        headers: {
          "User-Agent":
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        },
      });

      if (response.status === 429) {
        const wait = Math.pow(2, attempt) * 1000;
        console.log(`Rate limited. Waiting ${wait / 1000}s...`);
        await new Promise((r) => setTimeout(r, wait));
        continue;
      }

      if (!response.ok) {
        throw new Error(`HTTP ${response.status}`);
      }

      return await response.text();
    } catch (error) {
      console.log(`Attempt ${attempt} failed: ${error.message}`);
      if (attempt === maxRetries) throw error;
      await new Promise((r) => setTimeout(r, 2000));
    }
  }
}

This handles:

  • Rate limiting (429): Exponential backoff — wait longer each time
  • Server errors: Retry up to 3 times before giving up
  • User-Agent headers: Some sites block requests without a browser-like User-Agent

It’s basic, but it’ll save you hours of debugging mysterious failures.

Pro tip: respect robots.txt

Before scraping any site, check https://example.com/robots.txt. It tells you which paths are off-limits to crawlers. Ignoring it won't crash your code, but it can get your IP permanently banned — and in some jurisdictions, it has legal implications.


When your scraper breaks (and it will)

I promised you the honest version. So here’s the reality: your scraper will break. Not “might” — will.

The four things that will kill your scraper

1

HTML structure changes

That .product-price selector that worked yesterday? The site just redesigned and now it's .price-display__current. Your scraper returns empty arrays. You can use data-testid attributes for more resilient selectors, but you're building on someone else's foundation — and they don't care.

2

Anti-bot detection

Amazon, Google, LinkedIn — they all actively block scrapers. They check headers, request patterns, JavaScript execution, datacenter IPs, and browser fingerprints. A basic Puppeteer scraper trips at least three of those signals. Getting past modern anti-bot systems is a full-time infrastructure job.

3

Rate limiting and IP bans

Even friendly sites block you if you send too many requests too fast. Some do it silently — returning stale data or redirecting to a CAPTCHA without changing the HTTP status code. Others hard-ban your IP for days.

4

Scale problems

Scraping 100 pages? Fine. Scraping 100,000 pages daily? You need proxy rotation, queue management, data deduplication, monitoring for broken selectors, and servers for headless browsers (each Puppeteer instance eats ~300MB RAM).

The hidden cost of DIY scraping at scale

Proxy costs

$50–500/month

Residential proxies for anti-bot evasion

Server costs

$20–200/month

VPS for running headless browsers 24/7

Maintenance time

5–15 hours/month

Fixing broken selectors and blocked IPs

CAPTCHA solving

$1–3 per 1,000

Third-party CAPTCHA services

I’m not saying this to scare you off. Scraping is a genuine skill and sometimes it’s the only option. But you should know the real cost before you commit to maintaining a scraping pipeline in production.


When to use an API instead of scraping

Here’s what I wish someone had told me before I spent that weekend building scrapers: if someone already built and maintains the scraping infrastructure for you, just use their API.

Think about what a scraping API does:

  • Handles proxy rotation and IP management
  • Solves CAPTCHAs automatically
  • Adapts when the target site changes their HTML
  • Returns clean, structured JSON instead of raw HTML you have to parse
  • Scales without you managing servers

Compare what scraping Google search results looks like with Puppeteer vs. using the FlyByAPIs Google Search API :

DIY scraping vs API — same data, different effort

Scraping Google yourself

  • Launch headless browser (~300MB RAM)
  • Handle Google's anti-bot (reCAPTCHA)
  • Rotate residential proxies
  • Parse the ever-changing HTML
  • Handle rate limits and retries
  • Maintain when Google updates layout

~200 lines of code + infrastructure

Using a search API

  • One HTTP request
  • Get structured JSON back
  • Organic results, PAA, snippets included
  • 250 countries, 150 languages
  • No proxy management
  • Maintained by someone else

~10 lines of code, zero infrastructure

Here’s the API version — 10 lines that replace 200+ lines of scraping code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
async function searchGoogle(query) {
  const response = await fetch(
    `https://google-serp-search-api.p.rapidapi.com/search?q=${encodeURIComponent(query)}&num=10`,
    {
      headers: {
        "x-rapidapi-key": "YOUR_API_KEY",
        "x-rapidapi-host": "google-serp-search-api.p.rapidapi.com",
      },
    }
  );

  const data = await response.json();
  return data.organic_results;
}

searchGoogle("best javascript frameworks 2026").then((results) => {
  results.forEach((r) => {
    console.log(`${r.position}. ${r.title}`);
    console.log(`   ${r.link}\n`);
  });
});

That’s it. No Puppeteer, no proxy rotation, no CAPTCHA solving. You get structured data — titles, URLs, snippets, People Also Ask, featured snippets — all as clean JSON.

FlyByAPIs Google Search API covers 250 countries and 150 languages. Organic results, knowledge panels, and People Also Ask data in a single request that takes under 2 seconds.

The same logic applies to Amazon product data . Scraping Amazon is particularly painful — they have some of the most aggressive anti-bot systems on the internet.

I wrote a full guide on scraping Amazon with Python , and in that post I show how the Amazon Product Data API returns the same data with a single request across all 22 marketplaces.

The decision framework

Not every scraping job needs an API. Here’s how I decide:

SituationBest approachWhy
One-time data grab from a simple siteCheerio scraperQuick, free, no maintenance needed
Scraping a JS-heavy site oncePuppeteer scraperGets the job done, toss the script after
Daily data from Google/Amazon/MapsGoogle Search data APIAnti-bot systems make DIY unsustainable
Production pipeline, 10K+ requests/dayAmazon scraping APIProxy + server costs exceed API pricing
Internal company tool or intranetCustom scraperNo anti-bot, you control the source

The breakeven point is roughly this: if you’re going to scrape the same major website more than once a week, an API will save you time and money within the first month.

The Google Search API starts at $0/month with 200 free requests — enough to prototype before committing.

For things like Google Maps data extraction , the calculation is even clearer. Google Maps is a fully client-rendered app — Cheerio is useless and Puppeteer needs constant babysitting to handle their authentication prompts and dynamic loading patterns.


Best practices for web scraping with Node.js

Whether you scrape websites as a one-off or run a production pipeline, these rules will keep your Node.js scrapers reliable:

1. Add delays between requests. At minimum, 1-2 seconds. Some sites need 3-5 seconds. Randomize the delay so your pattern doesn’t look automated.

1
2
const delay = (ms) => new Promise((r) => setTimeout(r, ms));
await delay(1000 + Math.random() * 2000); // 1-3 seconds, randomized

2. Set realistic headers. At minimum, include a User-Agent that looks like a real browser. Better yet, rotate through a few different ones.

3. Handle errors gracefully. Network requests fail. Pages change. Selectors break. Your scraper should log the error, skip the page, and keep going — not crash on the first 404.

4. Check robots.txt first. It’s the website’s stated policy on what’s fair game for automated access. Respecting it isn’t just polite — it’s often legally relevant.

When scraping isn't worth the effort

For high-value data sources like Crunchbase company profiles, the site has aggressive anti-scraping measures and complex JavaScript rendering. A dedicated Crunchbase API saves weeks of reverse-engineering their protection layers.

5. Cache aggressively. If you’re developing and testing your scraper, save the HTML locally after the first fetch. Parse from the cached file instead of hitting the server every time you tweak a selector.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
const fs = require("fs");

async function fetchOrCache(url, cacheFile) {
  if (fs.existsSync(cacheFile)) {
    return fs.readFileSync(cacheFile, "utf-8");
  }
  const response = await fetch(url);
  const html = await response.text();
  fs.writeFileSync(cacheFile, html);
  return html;
}

6. Know when to stop. If you’re spending more time maintaining your scraper than actually using the data, it’s time to switch to a purpose-built data API . I’ve been there. The sunk cost fallacy is real.

Try the Google Search API free on RapidAPI →

200 requests/month free · No credit card required


Quick reference: the complete toolkit

Here’s everything you need in one place:

ToolInstallUse for
Cheerionpm install cheerioParsing static HTML, fast and lightweight
Puppeteernpm install puppeteerDynamic sites, JS rendering, screenshots
Playwrightnpm install playwrightCross-browser alternative to Puppeteer
FlyByAPIsNo install — HTTP APIProduction data from Google, Amazon, Maps

Wrapping up

You now know how to scrape any website with Node.js — static pages with Cheerio, dynamic pages with Puppeteer, and how to handle the errors and blocks that inevitably come.

The honest truth: scraping is a fantastic skill for one-off data grabs, prototyping, and understanding how the web works under the hood. I still write quick scrapers all the time.

But for anything running in production — anything where you need the data to show up reliably tomorrow and next month and six months from now — the maintenance cost adds up fast. DIY scraping at scale costs $70-700/month in proxies and servers alone, before counting 5-15 hours of monthly maintenance.

That’s why we built APIs like our Google Search API , Amazon data API , and Google Maps scraping API — so you can spend time building your product instead of babysitting scrapers.

If you want to see the difference, the free tier gets you 200 requests/month with no credit card. Try scraping Google results yourself, then try the API. The comparison sells itself.

Start scraping smarter — try FlyByAPIs free →

200 free requests/month · Structured JSON · No proxy headaches

Now go build something.

Oriol.

FAQ

Frequently Asked Questions

Q What is the easiest way to scrape a website with Node.js?

For static websites, install Cheerio (npm install cheerio), fetch the page HTML, and use jQuery-style selectors to extract data in under 5 minutes. For JavaScript-rendered pages, use Puppeteer instead. For production scraping against anti-bot-protected sites like Google or Amazon, FlyByAPIs returns structured JSON without any parsing code.

Q Should I use Cheerio or Puppeteer for web scraping?

Use Cheerio when the data is in the page's HTML source (right-click → View Page Source) — it parses in ~50ms with ~30MB of memory. Use Puppeteer when content loads via JavaScript (SPAs, infinite scroll, login walls) — it needs ~300MB and 2-5 seconds per page. For recurring data needs from major platforms, FlyByAPIs handles both static and dynamic scraping server-side.

Q Is web scraping with Node.js legal?

Web scraping publicly available data is generally legal in most jurisdictions, but always check the website's Terms of Service and robots.txt. Scraping personal data, copyrighted content, or data behind authentication without permission can create legal issues. When in doubt, use an official API — like FlyByAPIs — which provides structured data with explicit permission.

Q How do I avoid getting blocked while scraping with Node.js?

Add delays between requests (1-3 seconds minimum), rotate User-Agent headers, respect robots.txt, and avoid hammering the same domain. For production scraping, you'll need proxy rotation, CAPTCHA solving, and retry logic — or use a data API like FlyByAPIs that handles all of this server-side.

Q Can I scrape Amazon product data with Node.js?

Yes, but Amazon has aggressive anti-bot protection that blocks most scrapers within minutes. You'd need rotating proxies, CAPTCHA solvers, and constant maintenance as Amazon updates their defenses. A faster alternative is FlyByAPIs Amazon Product Data API, which returns structured product data as JSON — prices, reviews, offers — across 22 marketplaces for $14.99/10K requests.

Q What is the best Node.js library for web scraping in 2026?

Cheerio is the top choice for static HTML parsing — it's fast, lightweight, and uses familiar jQuery syntax. For dynamic pages, Puppeteer remains the most popular headless browser library, though Playwright is gaining ground with better cross-browser support. For production use at scale, FlyByAPIs handles proxy rotation, anti-bot detection, and CAPTCHA solving across Google, Amazon, and Maps data.

Q How do I scrape data from a website that uses JavaScript rendering?

Use Puppeteer or Playwright — both launch a real Chromium browser that executes JavaScript, then extract data with page.evaluate() after calling waitForSelector(). This works for React, Vue, Angular, and any JavaScript-heavy site. For sites with aggressive anti-bot protection, FlyByAPIs runs the browser infrastructure server-side and returns clean JSON.

Q What is the difference between web scraping and using an API?

Web scraping extracts data from a website's HTML by simulating a browser visit. An API gives you the same data as structured JSON through a direct endpoint. Scraping requires maintaining code, handling blocks, and rotating proxies. APIs like FlyByAPIs Google Search API give you clean data with one HTTP request — no parsing, no proxy management, no maintenance.
Share this article
Oriol Marti
Oriol Marti
Founder & CEO

Computer engineer and entrepreneur based in Andorra. Founder and CEO of FlyByAPIs, building reliable web data APIs for developers worldwide.

Free tier available

Ready to stop maintaining scrapers?

Production-ready APIs for web data extraction. Whatever you're building, up and running in minutes.

Start for free on RapidAPI