Building a Reliable Browser Automation Worker with Python and MySQL

The Hard Part Isn't the Browser

If you've written one Playwright script, writing the next one feels easy. The hard part of browser automation in production isn't the click-this-fill-that scripting — it's everything around it. How do you run six browser sessions in parallel without them stepping on each other? What happens when the script crashes 4 hours into an overnight run? How do you tell the difference between "this URL legitimately failed" and "the worker died and never got to it"? How does the operator know whether the system is making progress at 2am?

This article walks through the architecture I use for reliable browser automation work — the kind that runs unattended for hours, recovers from its own crashes, and produces a clean audit trail an operator can read in the morning. The runtime is Python. The browser is Playwright (often attached to Kameleo for fingerprint control). The coordination layer is MySQL. The pattern works whether you're scraping product pages, processing orders inside a vendor portal, or driving customer service chats in isolated browser sessions.

The Queue Is the System

Every reliable worker pool I've built has the same backbone: a queue table in the database. The schema is small.

CREATE TABLE work_queue (
    id          BIGINT PRIMARY KEY AUTO_INCREMENT,
    payload     JSON NOT NULL,
    status      ENUM('pending', 'in_progress', 'done', 'error') DEFAULT 'pending',
    worker_id   INT NULL,
    attempts    INT DEFAULT 0,
    last_error  TEXT NULL,
    started_at  TIMESTAMP NULL,
    finished_at TIMESTAMP NULL,
    INDEX idx_status_id (status, id)
);

That's the whole queue. payload holds whatever the worker needs (URL, account ID, action type). Status moves from pending → in_progress → done (or error). The idx_status_id index is what makes "give me the next pending row" cheap, even when the table has hundreds of thousands of rows.

Workers claim work atomically:

UPDATE work_queue
SET status = 'in_progress',
    worker_id = :worker_id,
    started_at = NOW(),
    attempts = attempts + 1
WHERE status = 'pending'
  AND id % :total_workers = :worker_index
ORDER BY id
LIMIT 1;

The modulo partitioning (id % :total_workers = :worker_index) means workers don't compete for the same rows. Worker 0 handles ids 0, 6, 12, 18…; worker 1 handles 1, 7, 13, 19…; and so on. No row-level locking contention, no double-processing, no skipped rows. Six workers can hit the queue simultaneously and never collide.

After the UPDATE, the worker reads back its claimed row by worker_id + status='in_progress' and processes it. On success: UPDATE … SET status='done', finished_at=NOW(). On failure: UPDATE … SET status='error', last_error=:msg, finished_at=NOW(). Either way, the row stops being a candidate for future claims and the worker moves on.

Recovery: The Stuck-Row Problem

Crashes happen. The browser hangs, the worker process is killed, the machine reboots. When that happens, the row that worker was holding is stuck in in_progress forever — no worker will claim it, but it never finishes either. Without a recovery mechanism, every crash leaks one row from the queue.

The fix is a startup sweep. Every worker, on startup, runs:

UPDATE work_queue
SET status = 'pending',
    worker_id = NULL
WHERE status = 'in_progress'
  AND started_at < NOW() - INTERVAL 10 MINUTE;

Any row that's been in_progress longer than the longest legitimate task (10 minutes here; tune to your workload) is moved back to pending. The next worker that comes along claims it and tries again. Crashes become non-events.

The attempts counter on the row is the safety valve. The worker checks it before processing — if attempts is, say, 4 or higher, the worker marks the row error with reason "max attempts exceeded" and moves on. This catches the rare case where a specific row consistently crashes the worker (a malformed payload, a target page that hangs Playwright). Without the counter, the recovery sweep would put the poison row back in the queue forever, and the workers would crash on it in turn.

Worker Process Design

One worker is one process. Each process owns one Playwright instance and one browser context for the duration of its lifetime, claims work in a loop, and exits cleanly when the queue is drained.

def worker_main(worker_index: int, total_workers: int):
    db = connect_db()
    recovery_sweep(db)

    with sync_playwright() as pw:
        browser = pw.chromium.launch(headless=False)
        context = browser.new_context()
        page = context.new_page()

        while True:
            row = claim_next(db, worker_index, total_workers)
            if row is None:
                break  # queue drained

            try:
                result = process(page, row['payload'])
                mark_done(db, row['id'], result)
            except BlockedError as e:
                rotate_proxy_or_profile(context, e)
                mark_pending(db, row['id'])  # retry on fresh state
            except Exception as e:
                capture_failure_artifacts(page, row['id'], e)
                mark_error(db, row['id'], str(e))

        context.close()
        browser.close()

The structure is intentionally boring. There's no clever scheduling, no thread pool inside the worker, no fancy event loop. One worker, one browser, one row at a time. The concurrency comes from running this process N times in parallel — not from making any individual worker complicated.

The launcher is a simple shell or Python script that spawns N copies of worker_main with different worker_index values, waits for all of them to exit, and produces a summary. Easy to write, easy to debug, easy to scale up or down by changing one number.

Error Capture: The 2am Operator Test

When a worker fails on a specific row, the operator needs enough information to diagnose it without running the script themselves. That means three things get captured to disk before the row is marked error:

A screenshot of the page at the moment of failure. page.screenshot(path=f"errors/{row_id}.png", full_page=True).
The rendered HTML. Path(f"errors/{row_id}.html").write_text(page.content()).
The exception traceback, written into the database last_error column.

Optionally, a network log if you've been recording page.on("response"). For Cloudflare-class debugging, save the response headers from the last few requests as well — the cf-ray header tells you exactly which Cloudflare edge served the block.

The cost of these captures is a few hundred milliseconds per failure and maybe 200 KB of disk per error row. The benefit is that "what happened to row 4729?" gets answered in 30 seconds instead of an hour of re-running the script under a debugger.

MySQL Specifics

A few MySQL details that aren't optional:

Use InnoDB. MyISAM doesn't have row-level locking and you'll get table-level lock contention on the queue. InnoDB is the default in modern MySQL; just don't override it.

Set innodb_lock_wait_timeout low. 5 seconds is a reasonable default. If a worker is going to wait on a queue lock for longer than that, something is wrong upstream and you want the worker to fail fast and retry, not hang.

Connection per worker. Don't share connections across processes. Each worker process opens its own connection at startup and keeps it for its lifetime. If you're spawning hundreds of workers, look at a connection pooler (ProxySQL) — but for the 4–12 worker counts most of these projects use, plain connections are fine.

The queue table doesn't need to be huge. Once a row is done for more than, say, 30 days, archive it to a history table and delete from the queue. The queue stays small and the indexes stay fast.

Operational Visibility

The whole point of putting the queue in MySQL (vs. Redis or an in-memory list) is that you get observability for free. A small dashboard — Flask, Sinatra, anything — running off the same database can show:

Queue depth by status (pending / in_progress / done / error)
Throughput over the last hour, day, week
Worker count and per-worker throughput
Most recent errors with screenshots one click away
A "drain pause" toggle that prevents new pending work from being claimed (for graceful shutdown)

This dashboard is the difference between "the system is running" and "I can see exactly what the system is doing right now." The first is a hope; the second is operations.

Wrap-Up

The architecture above is intentionally simple. There's no Kubernetes, no Celery, no message broker, no exotic scheduling. A queue table in MySQL, N worker processes that claim rows by modulo, a recovery sweep on startup, and disciplined error capture. That's the whole pattern, and it scales comfortably from a single dev laptop running 4 workers to a multi-server deployment running 50.

What it gets you is the property that matters most: the system survives the things that go wrong overnight. Crashes recover. Errors get captured with enough detail to debug. Throughput is measurable. The operator looking at the dashboard at 7am knows exactly what happened while they were asleep.

For the rest of the picture — how to handle bot protection inside the worker, how to integrate Kameleo, how to build the operator dashboard on top — see the Browser Automation, Kameleo Automation, and Automation Dashboards hubs.