Build a SERP Rank Tracker That Doesn’t Burn Proxies

Rank tracking looks simple until you run it every day, for many keys, in many towns. Then you hit blocks, odd HTML, and high proxy bills. You also hit a hard truth: a rank tracker is a data pipeline, not a script.

This guide shows one design that keeps costs in check. It targets devs who want repeat runs, clear logs, and clean data for SEO reports. It uses Playwright, caching, and tight retry rules.

Prereqs: Python basics, async comfort, and a place to store rows. You can start with SQLite and move to Postgres later. You also need a proxy plan that fits your risk and load.

What breaks most rank trackers

Most trackers fail for three reasons: no cache, weak block checks, and noisy retries. No cache makes you pay twice for the same query. Weak checks let bad HTML slide into your DB.

Noisy retries look like a bot storm. You send the same request five times in ten sec. Many sites answer with HTTP 429 when you push too fast.

Rank data also shifts by place, device, and fresh state. If you do not lock those inputs, you can’t trust diffs. Treat each input as part of the cache key.

A pipeline that spends less proxy

Step 1: Normalize each task

Define one task as: keyword, locale, geo hint, device, and a search URL template. Keep device to two values: desktop or mobile. Keep locale as a short tag like en-US.

Hash the task to get a stable task_id. That id lets you dedupe and re-run with ease. It also makes your logs tight.

Step 2: Cache on task_id plus day

Most teams re-check the same keyword more than once per day by mistake. Cache the raw HTML and the parsed rank for a set window. A 24-hour TTL works for daily reports.

Store cache hits and misses as first class stats. You want to see them in CI logs. Cache hit rate tells you if you waste proxy.

Step 3: Pick proxy mode per target

Do not start with the most costly proxy for all runs. Start with no proxy for low-risk pages and fall back when you see blocks. You cut spend fast when you use tiers.

For hard SERPs, start with a small pool of real-user IPs from Byteful.

Use sticky sessions for one task run. Rotate between tasks, not inside one page flow. That keeps cookies and geo hints in sync.

Step 4: Detect blocks before you parse

Block pages often return HTTP 200, so status codes alone fail. Add content checks for known bad signs like captcha words, empty result shells, or sudden login walls. Keep these checks simple and fast.

Stop retries when you see a hard block twice in a row. Mark the task as blocked and move on. Your queue stays healthy.

Step 5: Store raw HTML plus a slim parse

Store the raw HTML or a gzip blob for each fetch. It lets you re-parse when the DOM shifts. It also helps you debug rank jumps.

Parse only what you need for reports. Capture top N results, title, host, and the found rank for your domain. Keep the parse stable and version it.

Minimal Playwright worker (Python)

This worker shows the core loop: fetch, block check, parse, and save. It skips queue code to keep focus. You can wrap it with Redis, Celery, or a simple DB lease.

import asyncio

from playwright.async_api import async_playwright

BLOCK_MARKERS = ["captcha", "unusual traffic", "verify you are"]

def looks_blocked(html: str) -&gt; bool:

h = html.lower()

return any(m in h for m in BLOCK_MARKERS) or len(html) &lt; 5000

async def fetch_html(url: str, proxy: dict | None, user_agent: str) -&gt; str:

async with async_playwright() as p:

browser = await p.chromium.launch(

headless=True,

proxy=proxy

)

ctx = await browser.new_context(user_agent=user_agent, locale="en-US")

page = await ctx.new_page()

await page.goto(url, wait_until="domcontentloaded", timeout=45000)

html = await page.content()

await browser.close()

return html

async def run_task(url: str, proxy: dict | None):

ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120 Safari/537.36"

html = await fetch_html(url, proxy=proxy, user_agent=ua)

if looks_blocked(html):

return {"ok": False, "reason": "blocked", "html": html}

# Keep parse small. Swap this with a real parser.

found = "example.com" in html

return {"ok": True, "found": found, "html": html}

if __name__ == "__main__":

asyncio.run(run_task("https://www.google.com/search?q=python+playwright", proxy=None))

Keep proxy config as data, not code. A Playwright proxy dict maps well to secret stores. You can also switch proxy per task based on past fail rate.

Compliance and site load rules you can enforce

Set a hard cap on fetch rate per host. Put it in code, not in a runbook. A simple token bucket per domain works well.

Respect robots.txt when your use case needs it. Talk to counsel when you scrape at scale or store user data. You control risk when you define scope, TTL, and access rules up front.

Log what you fetch, when you fetch it, and why. Keep request IDs and task IDs in every row. That audit trail helps when a vendor asks questions.

If you build rank tracking as a pipeline, you gain control. Cache cuts cost, block checks cut noise, and tiered proxies cut spend. You also end up with data you can trust in a report or a dashboard.

Comments

Loading comments…