Stop Getting Blocked: A Practical Blueprint for a Resilient Scraping Pipeline

Stackademic

Most scraping failures come from one cause: you ship a script, then the target site changes one small rule. Your job shifts from “grab HTML” to “keep a data feed alive.” That work looks a lot like ops.

Bot defense also sits on your critical path. Imperva has reported that bots drive about half of all web traffic, and bad bots make up about one third. Many sites assume automation by default, so you need a plan for blocks, not hope.

Start with a contract for the data

Pick one page type and write down the fields you must extract. Define what “good” means in code, not in your head. Your pipeline needs a simple yes or no check.

For a product page, you may need title, price, stock, and SKU. For a SERP page, you may need rank, URL, and snippet text. If any field drops, treat the run as failed.

Build a tight validator

Validate early, before you store or publish. Check for empty fields, odd types, and out of range prices. You also want a fingerprint for the page so you can spot template shifts.

def validate_product(p):

if not p.get("sku"): return False

if not p.get("title"): return False

if p.get("price") is None: return False

if p["price"] <= 0: return False

return True

Keep this strict. Loose checks hide breakage until your users complain.

Treat blocks as a normal response

A clean 200 status can still mean you lost. Many sites return a block page with a 200 code. Your client must detect that fast.

Detect block pages with cheap signals

Scan for key markers like “captcha,” “verify you are human,” or missing core nodes. Track response size too. A product page rarely drops to 5 KB by chance.

def looks_blocked(html):

s = html.lower()

if "captcha" in s: return True

if "verify you are human" in s: return True

if "<title>access denied" in s: return True

return False

Log the first 200 chars of the title for every failed page. That one line speeds up triage.

Use a retry budget, not endless loops

Retries help when a network hop fails. Retries hurt when you hit a hard block. Set a cap per URL, then switch tactics or stop.

Use jitter in backoff so your fleet does not sync. Keep the math simple so you can reason about load. A common pattern uses 2 to 4 tries with 2 to 20 seconds of wait.

Pick the right proxy tool for the job

Proxy choice should follow risk, not habit. Datacenter IPs run fast and cheap, but they burn fast on strict targets. Residential IPs blend in better, but they cost more and can run slower.

If the site ties trust to device-like traffic, rotate through mobile proxies. They can cut blocks on flows that punish data center ranges. Use them only where you see a real lift, since they add cost and more moving parts.

Rotate with intent

Rotate IPs by session, not by request, on pages that set cookies or run multi-step flows. Rotate by request for simple fetches that do not need state. Keep one user agent per session to avoid odd mixes.

Also cap concurrency per domain. Most blocks start when you look like a stress test, not when you look like one user.

Make your scraper look consistent, not “random”

Many teams over-randomize headers and break trust. Real browsers follow patterns. They do not swap language, timezone, and platform on each request.

Stabilize your fingerprint per session

Pick a small set of user agents that match your render stack. Tie each one to a matching Accept-Language and viewport size. Keep cookies for the session, and reset them when you rotate the session.

If you use a headless browser, keep it lean. Block images and fonts when the page still renders. You cut load and you lower the chance of timeouts.

Cache, dedupe, and store raw inputs

Scraping costs money because you pay in IP, CPU, and risk. A cache cuts all three. Cache by URL plus a stable key like SKU, and set TTL based on how fast the page changes.

Store raw HTML for failed parses in a short-lived bucket. Keep it for a day or two, then purge. That lets you debug without re-hitting the site.

Separate fetch, parse, and publish

Run fetchers as a queue worker that only returns bytes and metadata. Run parsers as a second step that turns bytes into fields. Run publishing as a third step that writes to your DB or sends events.

This split makes rollbacks easy. It also lets you re-parse old pages after you fix a selector.

Keep it compliant and safe for your team

Read the site’s terms and robots rules, and involve counsel when risk feels high. Avoid personal data unless you have a clear legal basis. Respect rate limits when the site states them.

Build an internal “stop switch” too. One config flag should cut traffic to a domain in minutes. That saves you during a legal request, a ban wave, or a bug that spikes load.