How to Scrape News Websites at Scale in 2026

By Marcus Reiner · 2026-05-18 · 10 min read · Engineering

newsscrapingrss

Pulling 100k articles/day across 5k publishers needs the right mix of RSS, sitemap and residential scraping.

RSS + sitemap first

80% of news sites publish full-text RSS or news-sitemap.xml updated within minutes of publication. Crawling these is free, fast and respectful — and the publishers actively want you to.

Residential for the holdouts

Top-tier publishers (NYT, WSJ, FT, Bloomberg) restrict their RSS to headlines. For full text on the open web (free articles), residential proxies + headless rendering work. For paywalled content, you need a subscription — scraping it is a CFAA + ToS issue.

Stack at 100k/day

Decodo residential ($2/GB) for the open web

Bright Data SERP API for Google News discovery

newspaper3k or trafilatura for HTML→article extraction

Postgres + S3 for storage; deduplicate on content hash

Respect crawl-delay

Most news sites publish robots.txt with crawl-delay. Honoring it keeps you off the bad-actor list and avoids legal complaints. 1-2 req/s per host is a safe ceiling.

Schema.org NewsArticle

Modern news sites embed NewsArticle JSON-LD with headline, author, datePublished, articleBody. Parse this — it's stable, structured, and survives UI redesigns.

FAQ

Can I republish full articles?

No — full-text republication is copyright infringement. Fair-use snippets + link-back is the model news aggregators use.

Back to Blog