How to Scrape News Websites at Scale in 2026
By Marcus Reiner · 2026-05-18 · 10 min read · Engineering
Pulling 100k articles/day across 5k publishers needs the right mix of RSS, sitemap and residential scraping.
RSS + sitemap first
80% of news sites publish full-text RSS or news-sitemap.xml updated within minutes of publication. Crawling these is free, fast and respectful — and the publishers actively want you to.
Residential for the holdouts
Top-tier publishers (NYT, WSJ, FT, Bloomberg) restrict their RSS to headlines. For full text on the open web (free articles), residential proxies + headless rendering work. For paywalled content, you need a subscription — scraping it is a CFAA + ToS issue.
Stack at 100k/day
Decodo residential ($2/GB) for the open web
Bright Data SERP API for Google News discovery
newspaper3k or trafilatura for HTML→article extraction
Postgres + S3 for storage; deduplicate on content hash
Respect crawl-delay
Most news sites publish robots.txt with crawl-delay. Honoring it keeps you off the bad-actor list and avoids legal complaints. 1-2 req/s per host is a safe ceiling.
Schema.org NewsArticle
Modern news sites embed NewsArticle JSON-LD with headline, author, datePublished, articleBody. Parse this — it's stable, structured, and survives UI redesigns.
FAQ
Can I republish full articles?
No — full-text republication is copyright infringement. Fair-use snippets + link-back is the model news aggregators use.