Scraping is Network Engineering: Numbers That Cut Risk and Cost

Automation is nearly half of the public web’s activity. Automated traffic now accounts for roughly one in two requests, and malicious automation alone sits at about one in three. That baseline matters because it explains why modern sites scrutinize connection patterns, TLS behavior, and IP reputation far more aggressively than in earlier years. Treat scraping as a networking project first and an extraction project second, and many reliability problems shrink to math and protocol choices.

Median pages are heavier and chattier than most pipelines budget for. Typical mobile pages make around 70 to 75 requests and transfer close to 2 MB, with images near the 1 MB mark and JavaScript often approaching 400 to 500 KB. If your crawler fetches the full document and static assets, bandwidth and time-to-first-byte expectations need to reflect that request fan-out, not a single GET for HTML.

That footprint translates directly into money at scale. Pulling 10 million pages in a month at 2 MB each moves about 20 TB. With common cloud egress priced roughly 0.05 to 0.12 USD per GB, the bandwidth line item alone lands near 1,000 to 2,400 USD before compute and storage. Cutting just 20 percent of payload per page by skipping noncritical assets reduces several thousand dollars across a quarter.

Protocol support also shapes concurrency. Around 46 percent of sites use HTTP/2 and roughly 30 percent advertise HTTP/3. On HTTP/2, multiplexing dozens of streams over a single TCP connection slashes connection churn and head-of-line blocking compared to HTTP/1.1. On HTTP/3, QUIC eliminates slow-start penalties on loss-heavy links. Negotiating to the highest protocol both ends support is a measurable throughput win for crawlers with high parallelism.

Table of Contents

Connection reuse, TLS details, and why they move the needle

TLS 1.3 reduces handshake round trips compared to TLS 1.2, which is visible in tail latencies when you establish many new connections. Session resumption and long-lived keep-alive pools matter because they shift your budget from connect time to useful bytes. In real fleets, aggressively reusing connections cuts total sockets by an order of magnitude on HTTP/2 targets, which lowers kernel overhead and makes ban patterns less spiky.

A single headless Chromium process commonly consumes 200 to 300 MB of memory, so 50 concurrent browsers can demand 10 to 15 GB before your parsers do any work.

Headless browsers are still essential for script-rendered content and anti-bot challenges, but the numbers above explain why a split architecture performs better. Use lightweight HTTP clients for the 80 percent that render server side, reserve headless capacity for pages that truly require it, and you raise success rates while flattening your compute curve.

IP reputation, quotas, and pool sizing by arithmetic

Defenders watch for rates per IP and per ASN because bot traffic is widespread. IPv6 is now used by around 40 to 45 percent of users globally, and many large sites accept it natively, which expands the address space you can rotate through. If a target allows 100 successful requests per IP per hour and you need 1,000,000 pages per day, that is about 41,667 per hour, which requires roughly 417 IPs just to meet the quota without tripping rate controls. If your observed 429 rate is 5 percent, add at least that much headroom to the pool.

Operational hygiene reduces avoidable errors. Normalize credentials, ports, and schemes so your fleet does not lose requests to parsing glitches or malformed proxy lines. If you maintain mixed providers and formats, run them through a single proxy formatter before deployment to keep the pool consistent and auditable.

Data correctness beats extra requests

The median page’s 70 plus requests hide a simpler path on many sites. Public HTML frequently embeds JSON endpoints that return compact payloads, and those payloads change structure far less often than CSS selectors in complex DOMs. When you can, pull the JSON directly and you trade a 2 MB page and dozens of objects for tens of kilobytes. That swap increases records per second, reduces bandwidth spend, and lowers the chance that cosmetic layout updates break your parser.

Failures compound quickly at scale, so size your retry policy with concrete math. At a 2 percent raw failure rate, a 1,000,000 page job yields about 20,000 failed attempts on the first try. A two-retry strategy with independent attempts drives expected residual failures below 1 percent of the original volume. Log per-attempt status, protocol, handshake time, and target IP so you can separate network issues from server-side throttling and tune the right control, not just add more retries.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Connection reuse, TLS details, and why they move the needle

IP reputation, quotas, and pool sizing by arithmetic

Data correctness beats extra requests

Related News

What's new