Automation is nearly half of the public web’s activity. Automated traffic now accounts for roughly one in two requests, and malicious automation alone sits at about one in three. That baseline matters because it explains why modern sites scrutinize connection patterns, TLS behavior, and IP reputation far more aggressively than in earlier years. Treat scraping as a networking project first and an extraction project second, and many reliability problems shrink to math and protocol choices.
Median pages are heavier and chattier than most pipelines budget for. Typical mobile pages make around 70 to 75 requests and transfer close to 2 MB, with images near the 1 MB mark and JavaScript often approaching 400 to 500 KB. If your crawler fetches the full document and static assets, bandwidth and time-to-first-byte expectations need to reflect that request fan-out, not a single GET for HTML.
That footprint translates directly into money at scale. Pulling 10 million pages in a month at 2 MB each moves about 20 TB. With common cloud egress priced roughly 0.05 to 0.12 USD per GB, the bandwidth line item alone lands near 1,000 to 2,400 USD before compute and storage. Cutting just 20 percent of payload per page by skipping noncritical assets reduces several thousand dollars across a quarter.
Protocol support also shapes concurrency. Around 46 percent of sites use HTTP/2 and roughly 30 percent advertise HTTP/3. On HTTP/2, multiplexing dozens of streams over a single TCP connection slashes connection churn and head-of-line blocking compared to HTTP/1.1. On HTTP/3, QUIC eliminates slow-start penalties on loss-heavy links. Negotiating to the highest protocol both ends support is a measurable throughput win for crawlers with high parallelism.
Connection reuse, TLS details, and why they move the needle
TLS 1.3 reduces handshake round trips compared to TLS 1.2, which is visible in tail latencies when you establish many new connections. Session resumption and long-lived keep-alive pools matter because they shift your budget from connect time to useful bytes. In real fleets, aggressively reusing connections cuts total sockets by an order of magnitude on HTTP/2 targets, which lowers kernel overhead and makes ban patterns less spiky.
A single headless Chromium process commonly consumes 200 to 300 MB of memory, so 50 concurrent browsers can demand 10 to 15 GB before your parsers do any work.
Headless browsers are still essential for script-rendered content and anti-bot challenges, but the numbers above explain why a split architecture performs better. Use lightweight HTTP clients for the 80 percent that render server side, reserve headless capacity for pages that truly require it, and you raise success rates while flattening your compute curve.
IP reputation, quotas, and pool sizing by arithmetic
Defenders watch for rates per IP and per ASN because bot traffic is widespread. IPv6 is now used by around 40 to 45 percent of users globally, and many large sites accept it natively, which expands the address space you can rotate through. If a target allows 100 successful requests per IP per hour and you need 1,000,000 pages per day, that is about 41,667 per hour, which requires roughly 417 IPs just to meet the quota without tripping rate controls. If your observed 429 rate is 5 percent, add at least that much headroom to the pool.
Operational hygiene reduces avoidable errors. Normalize credentials, ports, and schemes so your fleet does not lose requests to parsing glitches or malformed proxy lines. If you maintain mixed providers and formats, run them through a single proxy formatter before deployment to keep the pool consistent and auditable.
Data correctness beats extra requests
The median page’s 70 plus requests hide a simpler path on many sites. Public HTML frequently embeds JSON endpoints that return compact payloads, and those payloads change structure far less often than CSS selectors in complex DOMs. When you can, pull the JSON directly and you trade a 2 MB page and dozens of objects for tens of kilobytes. That swap increases records per second, reduces bandwidth spend, and lowers the chance that cosmetic layout updates break your parser.
Failures compound quickly at scale, so size your retry policy with concrete math. At a 2 percent raw failure rate, a 1,000,000 page job yields about 20,000 failed attempts on the first try. A two-retry strategy with independent attempts drives expected residual failures below 1 percent of the original volume. Log per-attempt status, protocol, handshake time, and target IP so you can separate network issues from server-side throttling and tune the right control, not just add more retries.
