Most scraping tutorials reach for Python. I get it: BeautifulSoup is friendly and there are a thousand guides for it. But if you have ever watched a Python scraper crawl 50,000 pages one slow request at a time, you have probably wondered if there is something faster.
There is. A golang web scraper built with Colly will saturate your network long before it saturates a CPU core, and it ships as a single binary you can drop on any server.
I just built one to write this post. It crawls all 1,000 books across 50 pages of a sandbox site, extracts five fields each, and writes a clean CSV. End to end it runs in about 20 seconds and the core logic is under 60 lines of Go.
< 30 min
Build time, start to finish
~60 lines
Core scraper logic
1,000
Books scraped, 50 pages
1 binary
No runtime to install
We run web data extraction infrastructure for a living, so I have opinions about where DIY scraping pays off and where it quietly becomes a second job. This post is the honest version: build the thing, ship it, and know exactly when to stop maintaining it.
By the end you will have a working scraper that handles requests, CSS selectors, pagination, concurrency, CSV output, and the anti-blocking basics. All the code runs. You can clone it.
In short: a Go web scraper is a program that fetches web pages and extracts structured data from them, and the standard tool for it is Colly. The scraper in this post crawls 1,000 books across 50 pages and writes a clean CSV in about 20 seconds, in under 60 lines of core logic. When the target fights back with bot defenses, FlyByAPIs runs the same job as a single HTTP call. The complete, runnable project is on GitHub: flybyapis/blog-web-scraping-code → golang-web-scraper . Clone it, run
go run ./colly-scraper, and watch it work before you read another word.
Why Go is a good fit for web scraping
Scraping is mostly waiting. Your program fires a request, then sits idle while bytes travel across the internet. The faster you can run those waits in parallel, the faster the whole job finishes.
This is exactly what Go was built for. Goroutines make concurrency cheap, so a Go scraper can keep hundreds of requests in flight without the threading headaches you would hit elsewhere.
Where Go wins
Native concurrency, low memory use, a single compiled binary, and fast HTML parsing. Great for large crawls and long-running services.
Where Go lags
Fewer scraping libraries than Python, and no first-class headless browser. JavaScript-heavy sites need extra tooling.
If you have done web scraping with Python before, the mental model carries over. The difference shows up at scale, when the same job that pinned a Python process barely registers in Go.
I am not here to start a language war. Python is excellent, and our Python web scraping guide walks through that side. Pick the language your team already runs. If that is Go, you are in good hands with Colly.
What you need before you start
Two things: a recent Go install and one library. That is the whole setup.
First, create a module for the project. This is just standard Go modules , nothing scraping-specific:
| |
Then add Colly, the only dependency you need for HTML scraping:
| |
What is Colly?
Colly is a scraping framework for Go. You create a "collector," register callbacks for the HTML you care about, and tell it which URLs to visit. It handles requests, parsing, concurrency, caching, and proxies so you do not have to.
Our target is books.toscrape.com , a site the Scrapy team built specifically so people can practice without annoying anyone. It has a product grid, real prices, and pagination across 50 pages. Perfect for learning.
Building your golang web scraper with Colly
Here is the plan. We will build the scraper in five steps, each one adding a real capability. By the end you will have the full thing.
I will show the important pieces inline. The complete file lives in the golang-web-scraper repo so you are never copying half-finished snippets.
Step 1: Send your first request
Every Colly scraper starts the same way. You make a collector, attach a callback, and visit a URL.
| |
Run it with go run . and you will see one request go out and the response come back. AllowedDomains is a small safety net: it stops the crawler from wandering off-site if a link points somewhere unexpected.
That is the whole rhythm of Colly. Register callbacks, then visit. Everything else is variations on this.
Step 2: Select the data with CSS selectors
Now the part you actually came for: pulling fields out of the HTML. Colly uses CSS selectors, the same ones you would use in the browser console.
Open the target page, right-click a book, and inspect it. Each book sits inside article.product_pod, with the title in an anchor, the price in p.price_color, and the rating encoded as a CSS class.
| |
Pro tip: grab the title attribute, not the text
The visible h3 text on this site is truncated with an ellipsis. The full title lives in the anchor's title attribute, which is why we use ChildAttr("h3 a", "title") instead of ChildText.
ChildText grabs the text inside a selector. ChildAttr grabs an attribute. AbsoluteURL turns a relative href into a full link you can actually follow. Those three cover almost everything.
The rating is a small puzzle. The HTML stores it as class="star-rating Three", so you read the class string and take the second word. A two-line helper handles that in the full code.
Step 3: Handle pagination and scrape every page
One page of books is not a dataset. The whole point of golang web scraping is to walk the entire catalog, which means following the “next” link until it runs out.
This is where Colly feels almost too easy. You register a callback for the pagination link and tell it to visit whatever it finds.
| |
That is it. Colly sees the next link on page one, visits page two, finds the next link there, and keeps going until there is no li.next left. Fifty pages, zero manual URL building.
Why this works
Colly keeps an internal queue of URLs to visit and remembers which ones it has already seen. Each Visit call adds to the queue. You are describing the link graph, and Colly walks it for you.
Step 4: Scrape pages concurrently
Here is where Go earns its keep. Crawling 50 pages one at a time is slow because most of that time is spent waiting on the network. Run them in parallel and the job collapses to a fraction of the time.
Turn on async mode when you create the collector, then add a limit rule so you stay polite:
| |
Two things change. Async(true) makes Visit return immediately instead of blocking, and c.Wait() at the end holds the program open until the crawl drains.
Concurrency means shared state needs a mutex
With async on, your OnHTML callback runs from several goroutines at once. If they all append to the same slice, you will get a data race. Wrap the append in a sync.Mutex or you will lose books and corrupt the results.
The RandomDelay matters more than it looks. Firing requests at full speed with no gap is the single most obvious bot signal there is. A small random delay makes the traffic look human and keeps you off block lists.
Step 5: Save the results to CSV
Scraped data that lives in memory and vanishes when the program exits is not useful. Let us write it to a CSV with the standard library, no extra packages.
| |
Collect every book into a slice during the crawl, then write the slice after c.Wait() returns. Open the file in any spreadsheet and there is your dataset.
If a spreadsheet is your actual destination, I wrote a whole post on getting scraped data into Excel cleanly that covers the formatting traps.
50
Pages crawled
1,000
Rows in the CSV
~20s
Total run time
That is a complete, working scraper. Five steps, and you can pull a full catalog into a CSV. The full version on GitHub wires all of this together with flags, a mutex, and the hardening we are about to add.
Making your Go scraper production-ready
The sandbox site is friendly. Real sites are not. The moment you point a scraper at a site that does not want to be scraped, you hit rate limits, bot detection, and IP bans.
Here is the hardening that actually moves the needle, all of it built into Colly.
Production checklist
Identity
Rotate the User-Agent
extensions.RandomUserAgent(c)
✓ Avoids the default Go UA flag
Pace
Limit rate + random delay
colly.LimitRule{...}
✓ Looks human, avoids bans
Resilience
Retry with backoff
c.OnError + Retry()
✓ Survives flaky responses
Scale
Rotate proxies
proxy.RoundRobinProxySwitcher
✓ Spreads load across IPs
Rotate your User-Agent
The fastest way to get blocked is to send Go’s default User-Agent on every request. It screams “bot.” Colly ships an extension that rotates through real browser strings:
| |
One line, and every request now looks like it came from a different browser.
Retry failed requests with backoff
Networks are flaky. A request that fails once often succeeds on the second try, so do not let a single timeout kill your crawl. Catch errors and retry with an increasing delay:
| |
The backoff grows with each attempt, so you are not pounding a struggling server. After three tries it gives up and moves on instead of hanging forever.
Add rotating proxies when you scale
One IP making thousands of requests is a pattern any defense will catch. Spread the load across a pool of proxies and each one looks like a normal visitor:
| |
The honest part about proxies
Good residential proxies cost real money, often hundreds of dollars a month at volume. This is the line where DIY scraping stops being free. Keep that number in mind for the next section.
Put all four together and you have a scraper that survives contact with a real website. The complete hardened version bundles every one of these with command-line flags.
Colly vs chromedp: when you need a headless browser
Colly has one hard limit. It reads the raw HTML the server sends, and nothing more. If a site builds its content with JavaScript after the page loads, Colly sees an empty shell.
You can test this in seconds. Scrape the page, print the body, and if the data you want is missing from the raw HTML, it is being rendered client-side.
Use Colly when
The data is in the page source. Server-rendered sites, classic HTML, most blogs, catalogs, and listings. Fast and cheap.
Reach for chromedp when
Content loads via JavaScript. Single-page apps, infinite scroll, data that appears only after interaction. Slower and heavier.
chromedp drives a real headless Chrome from Go, so it executes JavaScript exactly like a browser. The cost is speed and memory: you are running an actual browser per worker, which does not scale the way Colly does.
This is the same wall you hit in any language. Our Node.js scraping guide covers the Puppeteer equivalent, and the tradeoff is identical. Headless browsers are powerful and expensive, which is often the point where a hosted Go scraping API starts to look attractive.
When to stop scraping and use an API
I promised the honest version, so here it is. A scraper you build yourself is the right tool for plenty of jobs. It is the wrong tool for a few specific ones, and pretending otherwise wastes your time.
Building the scraper is the easy 20%. The other 80% is maintenance: proxies that get banned, layouts that change overnight, CAPTCHAs, and JavaScript walls. That work never ends.
The site has serious anti-bot defenses
Cloudflare, rotating tokens, fingerprinting. You will spend more time fighting the defense than using the data.
You are scraping Google, Amazon, or Maps
These targets fight back hard and change constantly. A maintained API is almost always cheaper than your time.
The data needs to be reliable
If a broken scraper means a broken product, you do not want a 2am page because a competitor redesigned their site.
This is the gap we built FlyByAPIs to close. Instead of maintaining proxies and parsers, you make one HTTP call and get clean JSON back. Same Go you already know.
Say you want Google search results. With a Go web scraping API for Google , the entire scraper is a single request:
| |
No proxies. No headless Chrome. No selector that breaks when Google ships a redesign. FlyByAPIs handles the IP rotation and parsing, so the response comes back as structured JSON with organic_results, each carrying title, link, description, and position. The runnable version is in the api-scraper folder
of the repo.
Free tier included · No credit card required
The same idea covers the targets that punish DIY scrapers the most. There is an Amazon product data API for prices and listings, a Google Maps data extraction API for local business records, and a Crunchbase company data API for firmographics.
Need something else? There is a jobs search API for listings and a translation API for turning scraped content into other languages at scale.
Bottom line:
Build your own scraper for niche sites, internal tools, and learning. Use a managed Google search API for the hard targets where uptime and clean data matter more than control.
The right answer is usually both. DIY where you can, API where the maintenance cost outweighs the freedom. A good engineer knows which is which.
The complete code
Everything in this post is in one repo, tested and runnable:
| |
The colly-scraper folder has the full hardened version with flags for output file, concurrency, retries, and proxies. The api-scraper folder shows the managed alternative.
Wrapping up
We started with a question: is there something faster than the usual Python scraper? There is, and you just built it. A golang web scraper with Colly handles requests, selectors, pagination, concurrency, and CSV output in under 60 lines, then crawls 1,000 records in about 20 seconds.
You also learned where the line is. Colly is brilliant for server-rendered sites and large crawls. It is the wrong tool for Cloudflare-protected targets and JavaScript walls, and for those a maintained Google Search API from FlyByAPIs saves you the part of scraping nobody enjoys.
The takeaway:
Reach for Colly when the data is in the page source and you control the crawl. Reach for a managed API when the target is Google, Amazon, or Maps, the defenses are serious, or the data has to stay reliable. The right answer is usually both.
Clone the repo, run it, break it, make it yours. Then point it at something real and see how far DIY takes you before the maintenance starts to bite.
What are you building with it? If you scrape Google, Amazon, or Maps and the bans are wearing you down, try the managed APIs free . Beats debugging proxies at 2am.
Free tier, no credit card. Stop maintaining proxies.
Oriol.
