Stop Getting Blocked: A Practical Blueprint for a Resilient Scraping Pipeline

Most scraping failures come from one cause: you ship a script, then the target site changes one small rule. Your job shifts from “grab HTML” to “keep a data feed alive.” That work looks a lot like ops.

Bot defense also sits on your critical path. Imperva has reported that bots drive about half of all web traffic, and bad bots make up about one third. Many sites assume automation by default, so you need a plan for blocks, not hope.

Start with a contract for the data

Pick one page type and write down the fields you must extract. Define what “good” means in code, not in your head. Your pipeline needs a simple yes or no check.

For a product page, you may need title, price, stock, and SKU. For a SERP page, you may need rank, URL, and snippet text. If any field drops, treat the run as failed.

Build a tight validator

Validate early, before you store or publish. Check for empty fields, odd types, and out of range prices. You also want a fingerprint for the page so you can spot template shifts.

def validate_product(p):

if not p.get("sku"): return False

if not p.get("title"): return False

if p.get("price") is None: return False

if p["price"] &lt;= 0: return False

return True

Keep this strict. Loose checks hide breakage until your users complain.

Treat blocks as a normal response

A clean 200 status can still mean you lost. Many sites return a block page with a 200 code. Your client must detect that fast.

Detect block pages with cheap signals

Scan for key markers like “captcha,” “verify you are human,” or missing core nodes. Track response size too. A product page rarely drops to 5 KB by chance.

def looks_blocked(html):

s = html.lower()

if "captcha" in s: return True

if "verify you are human" in s: return True

if "&lt;title&gt;access denied" in s: return True

return False

Log the first 200 chars of the title for every failed page. That one line speeds up triage.

Comments

Loading comments…

Stop Getting Blocked: A Practical Blueprint for a Resilient Scraping Pipeline

Start with a contract for the data

Build a tight validator

Treat blocks as a normal response

Detect block pages with cheap signals

Use a retry budget, not endless loops

Pick the right proxy tool for the job

Rotate with intent

Make your scraper look consistent, not “random”

Stabilize your fingerprint per session

Cache, dedupe, and store raw inputs

Separate fetch, parse, and publish

Keep it compliant and safe for your team