Production Web Scraping with Python — Architecture, Not Scripts

The Gap Between a Script and a System

Almost every Python developer has written a scraper. requests.get(), BeautifulSoup, a for loop, a CSV writer. Twenty minutes of work, and you have data. The script runs once. Maybe it runs twice. Then a week later you point it at a slightly larger job and watch it fall over.

The interesting question isn't how do I scrape this site? It's how do I build a scraper that I can run on a schedule for six months without babysitting it? That second question is an architecture question, not a parsing question. The actual extraction code is usually the smallest part of a production scraping system. Everything around it — the queue, the cache, the storage layer, the block detection, the recovery loop, the exporters — is what determines whether the system survives contact with the real internet.

This article walks through the architecture I use for serious scraping work. It's the same shape whether the job is a one-week scrape of 10,000 URLs or an ongoing pipeline that processes a few hundred thousand pages a month. The pieces don't change; the dials do.

Why the Naive Approach Fails

A 50-line script breaks for predictable reasons. The site changes its HTML and your parser starts returning empty strings — but you don't notice until the CSV is already half-written. The site rate-limits you and your script either retries forever or crashes mid-run, leaving you with no idea what completed and what didn't. The site adds Cloudflare and your requests.get() starts returning a "Just a moment…" challenge page that parses as legitimate HTML but contains none of the data you wanted. You restart the script after a crash and it cheerfully re-fetches every URL it already had, doubling your bandwidth and your blocks.

Each of these failures is solvable in isolation. But fixing them ad hoc, inside a script, leaves you with 1,000 lines of tangled retry logic and no clean way to add the next site. The right move is to take the failure modes seriously up front and put structure around them — once — so every future scraper inherits the same operational discipline.

The Production Architecture

Every production scraper I build is structured as a pipeline of stages backed by a shared database. The stages, in order, are:

Discover — figure out what URLs to fetch (a directory listing, a sitemap, a search result, a seed file).
Fetch — pull each URL to local storage. Cached by content hash; never refetched once successful.
Parse — turn the cached HTML into structured records.
Enrich — anything that requires combining records (lookups, joins, derived fields).
Score / transform — business logic that operates on the enriched record (confidence scores, classifications, normalization).
Export — produce the output format the downstream consumer wants (CSV, XLSX, JSONL, a database row).

The critical property: every stage reads from the database, does its work, and writes back to the database. No stage talks directly to any other stage. That means you can re-run any single stage in isolation — re-parse with a fixed selector without re-fetching, re-score with new business rules without re-parsing, re-export to a new format without touching anything else. Crashes don't lose work because the database always reflects the last consistent state.

The schema is straightforward. A urls table tracks every URL with a status column (pending, in_progress, fetched, parsed, error) and a last_attempt_at timestamp. A records table holds the parsed output. A blocks table logs every block encountered with the response signal (HTTP code, body marker, header) that triggered it. SQLite in WAL mode handles up to a few hundred thousand rows with multiple worker processes; MySQL takes over above that or whenever the data needs to be queried by a dashboard.

The Fetcher: Where Most Scrapers Die

The fetcher is the most failure-prone part of the system, so it deserves the most care. The rule I follow is use the lightest tool that gets the data. Every step up in sophistication costs time, money, and complexity. The hierarchy:

Plain requests with realistic headers, polite pacing (5–7 seconds between fetches with jitter), and on-disk caching. About 60% of "protected" sites fall to this layer.
httpx with HTTP/2 when the target uses modern TLS fingerprinting. Same code shape as requests, slightly more realistic to the wire.
Playwright headless when the site requires JavaScript execution to render the data you need.
Playwright + Kameleo + proxy pool when the site has Cloudflare, DataDome, or similar bot protection that checks browser fingerprints.

Whatever the layer, the fetcher's interface to the rest of the pipeline is identical: take a URL, return the raw response (or raise a known exception). The caller doesn't know or care which layer handled the request. That separation means you can upgrade the fetcher without touching parsers, schedulers, or storage code.

Caching is non-negotiable. Every successful fetch writes the response body, headers, and status to disk under a key derived from the URL. The next run that asks for the same URL gets the cached response back — zero network requests. This is how you iterate on parsers without burning bandwidth or proxies. It's also how a re-run after a partial failure costs nothing extra.

Block detection is built into the fetcher, not bolted on later. After every response, the fetcher inspects the HTTP status, body length, and a small set of known bot-protection markers ("Just a moment…", "Please verify you are a human", common DataDome strings). A detected block raises a BlockedError that the caller handles by rotating proxies / profiles and retrying exactly once. A second block on a fresh IP marks the URL error and the pipeline moves on. No retry storms, no silent corruption.

Parsers Should Be Boring

Parsers fail when sites change. The way to make that less painful is to write parsers that fail loudly and isolate the breakage. Two rules:

One parser per page type, in its own file. When the search-results page changes, you edit one file. When it changes again next week, you edit the same file. The parser doesn't know about the fetcher, the database, or the rest of the pipeline. It takes raw HTML in and returns a Pydantic model out.

Strict schemas catch silent failures. Pydantic with required fields means a parser that returns empty strings for a now-missing element raises a validation error instead of silently writing blank rows to the database. The error gets logged with the URL and the cached HTML path so you can debug without re-fetching.

Selectors are layered. CSS first because it's fastest. XPath when the structure requires walking up the DOM. Regex on text content as a last resort, when the data is in a free-form blurb. Whatever the selector, it lives in a single dictionary at the top of the parser file, so when the site changes, the diff is one line, not 30.

Storage and Recovery

SQLite in WAL mode is my default for parallel scraping work. Multiple worker processes can read concurrently and serialize writes through a single writer connection. For up to a few hundred thousand rows, it's faster than any networked database, requires zero infrastructure, and ships as a single file you can hand to a client.

The schema is queue-shaped. Workers SELECT id FROM urls WHERE status = 'pending' AND id % :workers = :worker_id LIMIT 1, then UPDATE urls SET status = 'in_progress' in the same transaction. The modulo partitioning means workers never collide on the same URL. A recovery pass at startup moves any in_progress rows older than 5 minutes back to pending, so a crashed worker's URLs get picked up by the next run.

For systems that need to outlive any single scrape, MySQL takes over. Same schema, same queue pattern, just with row-level locking instead of WAL. The dashboards that operators use to monitor and intervene in the system read from the same MySQL database the workers write to — there's no copy of the state living anywhere else.

Exports: The Boring Part That Matters Most

The output format is what the client actually consumes. CSV for analysts. XLSX (with formatted headers and a frozen first row) for spreadsheet folks. JSONL for engineers downstream. The exporters are tiny — usually 30 lines each — and read directly from the database. They're the last stage of the pipeline and the easiest to swap or add to.

Every export run produces a manifest: how many records, the date range covered, the schema version, a hash of the export file itself. That manifest gets emailed (or Slacked) to whoever cares, alongside a download link. The point is to make every successful run visible without anyone having to ask "did the scraper run last night?"

vs. Alternatives

Off-the-shelf scraping frameworks (Scrapy, in particular) cover a lot of the same architectural ground. If you're scraping at very high concurrency against a single site that doesn't have aggressive bot protection, Scrapy's pipeline is excellent — middlewares, spiders, item pipelines, the whole framework hangs together well. Where I move away from Scrapy is when the job needs Playwright or Kameleo (bolting browser automation onto Scrapy is awkward), when the system has to run alongside a dashboard or operator UI (Scrapy assumes it owns the process), or when the work is a multi-stage pipeline rather than a high-throughput crawl. The plain-Python pipeline shape I described above is more code than Scrapy for the simple case, but it scales gracefully into all of those harder cases without re-architecture.

Fully managed scraping services (Apify, ScrapingBee, Bright Data's collector) are the right choice when you don't want to operate the system yourself, the target is well-supported, and the volume justifies the per-request pricing. They stop making sense the moment you need custom logic in the parse stage, multi-stage processing, or integration with the rest of a back-office system. At that point you're paying for someone else's infrastructure and re-implementing the interesting parts on your side anyway.

Wrap-Up

The headline difference between a script and a system isn't lines of code. It's that the system survives the things a script ignores: a site change, a block, a network blip, a partial run, a re-export against last week's data, an operator looking at a dashboard at 2am. Every architectural decision above earns its keep by absorbing one of those events without human intervention.

If you're sketching out a scraper that has to run more than once, the time to put this structure in place is at the start. Retrofitting a 1,000-line script into a pipeline is more work than building the pipeline first.

For the deeper "how do I do X reliably?" questions — proxy rotation, session recovery, JavaScript-rendered targets, anti-bot strategy — the related articles in the Python Web Scraping and Browser Automation hubs go further into each piece.