Command Palette

Search for a command to run...

GitHub
Back to blog

The Queue You Already Have

Samith ReddyMay 30, 20269 min read
systemspostgresbackendengineering
The Queue You Already Have
Share:

Most backend engineers reach for Redis the moment they need a job queue. It is fast, it is documented everywhere, and every tutorial assumes you already have it running. The mental model has quietly hardened into a rule: queues need Redis.

But Redis is a separate process. A separate deployment. A separate failure domain. One more thing to monitor, back up, secure, and keep alive at 2am when something goes wrong in the middle of a fest demo.

What if, for most of the work you do, you do not need it?

This post is the long version of an argument I keep having with myself and with other engineers. The argument is that the database you already run is a correct, durable, observable job queue, and has been since 2016. We are going to go all the way down to the row-locking mechanics to see exactly why.


The Problem With Naive Database Queues

Before Redis became the reflex, teams did try to build queues on their relational databases. The idea is the most natural thing in the world: a jobs table, some workers polling it, mark rows done as you go.

It did not work. It is worth being precise about why, because the reason is the whole story.

Imagine two workers asking for the next available job at almost the same instant:

-- Worker A
SELECT * FROM jobs WHERE status = 'pending' ORDER BY created_at LIMIT 1;
 
-- Worker B (5 ms later)
SELECT * FROM jobs WHERE status = 'pending' ORDER BY created_at LIMIT 1;

Both queries match the same oldest row. Both workers return it. Both start processing it. Your customer gets billed twice, the confirmation email goes out twice, or your inventory drops by two when it should have dropped by one.

Hand-drawn diagram: Worker A and Worker B both read job #42 and both charge the customer

This is not a contrived edge case. It is the default behavior of a plain SELECT. Reads in PostgreSQL do not block other reads, which is normally a feature, so nothing stops two workers from seeing the same "next" row. Under any real concurrency, duplicates are not a risk. They are a certainty.

The textbook fix is SELECT ... FOR UPDATE, which takes an exclusive row lock while you work. Worker A locks the row, Worker B blocks until A's transaction ends. Correct, finally. But watch what happens under load: every worker that wants "the next job" lines up behind the same oldest row. They are not working in parallel anymore. They are queued single-file behind one lock. Throughput collapses to roughly one worker. Your fleet of 50 workers becomes 50 cars idling at the same red light.

This exact failure mode, correct but serialized, is what pushed the industry toward Redis in the first place. Redis sidesteps locking entirely with atomic list operations (LPOP, BRPOPLPUSH): popping an element is removing it, so two clients can never pop the same element.

PostgreSQL has its own answer to this, and it is two words.


FOR UPDATE SKIP LOCKED

SKIP LOCKED shipped in PostgreSQL 9.5 (January 2016). It modifies a row lock in exactly one way: instead of waiting for a row that someone else has locked, the query pretends that row is not there and moves on to the next unlocked row that matches.

SELECT id, payload
FROM jobs
WHERE status = 'pending'
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKED;

That single clause is the difference between a traffic jam and a fan-out.

Hand-drawn comparison: FOR UPDATE has workers blocked in a line; SKIP LOCKED gives each worker its own row

Here is what happens when 50 workers run that query at the same moment:

  • Worker 1 locks row A. Workers 2 through 50 see A is locked and skip it.
  • Worker 2 locks row B. Workers 3 through 50 skip B.
  • Worker 3 locks row C. Workers 4 through 50 skip C.
  • ...and so on down the list.

No worker waits on another. No two workers get the same row. Each one walks away with a unique job, atomically. The coordination happens entirely inside PostgreSQL, with no external broker, no Lua, and no application-level locking.

Why this is actually safe: a peek at MVCC

It helps to know why the database can do this without lying to anyone. PostgreSQL uses MVCC, short for Multi-Version Concurrency Control. Every row carries hidden bookkeeping columns (xmin, xmax) that record which transaction created and which deleted or locked a given version. When you take a FOR UPDATE lock, PostgreSQL writes a lock marker tied to your transaction id onto that row version.

SKIP LOCKED simply tells the executor this: while scanning, if you reach a row whose current version is locked by a different live transaction, do not enqueue yourself behind it. Treat it as invisible for this query and keep scanning. The locks are real row locks (released on COMMIT or ROLLBACK, recovered on crash), so correctness never depends on the application behaving well. The one thing you give up is the ordering guarantee that a strict queue would have. Under contention, jobs can be picked slightly out of created_at order. For a work queue, that is exactly the trade you want.

The complete, safe claim

In production you don't just select. You select and mark the row as taken, in one atomic statement, so there is no gap between "I found a job" and "I claimed it":

WITH claimed AS (
  SELECT id
  FROM jobs
  WHERE status = 'pending'
  ORDER BY created_at
  LIMIT 1
  FOR UPDATE SKIP LOCKED
)
UPDATE jobs
SET status = 'processing', claimed_at = NOW()
FROM claimed
WHERE jobs.id = claimed.id
RETURNING jobs.*;

Because the SELECT and the UPDATE live in the same transaction, there is no window where another worker can sneak in. The row is locked the instant it is selected and flipped to processing before the transaction commits. One round-trip, fully atomic, zero races.


What You Get For Free

The quiet payoff of keeping the queue inside your database is that it inherits every guarantee PostgreSQL already provides. You are not bolting reliability on afterward. It is the substrate.

Crash safety, via the WAL. PostgreSQL writes every change to its Write-Ahead Log before acknowledging a commit. If a worker dies mid-job, the row is simply still sitting in processing. A one-line cleanup query on startup (or on a timer) sweeps stale claims back to pending. Nothing is silently dropped, because the durability guarantee is the same one that protects your actual business data.

Hand-drawn diagram: a worker crashes mid-job; the row survives in the WAL and is reclaimed on startup

-- reclaim jobs whose worker died holding them
UPDATE jobs
SET status = 'pending'
WHERE status = 'processing'
  AND claimed_at < NOW() - INTERVAL '5 minutes';

Exactly-once-ish semantics, without heroics. Here is the part Redis users quietly pay for. When your job state and your application data live in the same database, you can update both in one transaction: mark the job done and write its result atomically. Either both land or neither does. With Redis plus a separate database, every completion is a dual write to two systems that can crash in between. That is the classic dual-write problem. Getting close to exactly-once there means careful idempotency keys or hand-written Lua. In Postgres it is a COMMIT.

Hand-drawn comparison: Redis+DB can crash between two writes and diverge; one Postgres transaction commits both together

Strictly speaking, true exactly-once delivery is impossible in a distributed system. What you actually get is at-least-once delivery plus exactly-once effects through transactional idempotency. Postgres makes that second half almost free, and that is the real win.

A dead-letter queue that is just a WHERE clause. Jobs that blow past their retry budget become WHERE status = 'dead'. No new infrastructure, no separate dashboard, just a query you already know how to write.

Full history and observability. With Redis, a job ceases to exist the moment it is popped, so your history is gone. In Postgres, the row is right there:

SELECT * FROM jobs
WHERE status = 'failed'
  AND created_at > NOW() - INTERVAL '1 hour';

Every failure, retry count, payload, and timestamp is queryable with the same SQL you use for everything else. Your debugging tools are already installed.

Retries with exponential backoff, in one column. Add a "do not run before" timestamp and the scheduler falls out for free:

ALTER TABLE jobs ADD COLUMN run_after TIMESTAMPTZ DEFAULT NOW();
 
-- worker only considers jobs that are due
WHERE status = 'pending' AND run_after <= NOW()
 
-- on failure, push it into the future, further each time
UPDATE jobs
SET status = 'pending',
    attempts = attempts + 1,
    run_after = NOW() + (INTERVAL '5 seconds' * (2 ^ attempts))
WHERE id = $1;

The Honest Caveat: Bloat

I would be selling you something if I stopped at "it's free." A queue table is a high-churn table. Rows are inserted, updated several times, and deleted constantly. Under MVCC, every UPDATE writes a new row version and leaves the old one as a dead tuple for autovacuum to clean up later. A busy queue can generate dead tuples faster than autovacuum reclaims them, and both the table and its indexes bloat. A bloated pending index makes the very query you depend on slower over time.

It gets worse in one specific way that is easy to trip over. A single long-running transaction anywhere in your database holds back the global xmin horizon, which means autovacuum cannot clean up dead tuples newer than that transaction, including the ones in your queue table. One forgotten open transaction can quietly let a queue table balloon.

None of this is fatal, and it is well-trodden ground. The mitigations:

  • Tune aggressive per-table autovacuum on the queue (autovacuum_vacuum_scale_factor near 0 with a small threshold).
  • Delete or archive completed jobs promptly instead of leaving them to accumulate.
  • For very high volume, partition by time and drop whole partitions.
  • Watch for long-lived transactions in pg_stat_activity. They are the usual culprit.

This is the real boundary of the technique, and it is an operational concern, not a correctness one. Know it exists before you ship.


You Are Not the First

If this felt like a fringe trick, it is not. Some of the most-used job systems in the ecosystem are built on exactly this clause:

  • SolidQueue. As of Rails 8 it is the default Active Job backend, built on FOR UPDATE SKIP LOCKED, explicitly to let Rails apps drop Redis and Sidekiq.
  • River. A fast, robust queue for Go + Postgres (via pgx), built around SKIP LOCKED. Directly relevant to me, since the project below is Go.
  • graphile-worker and pg-boss. The two go-to Postgres-backed queues in Node, both leaning on the same primitive.
  • que. The long-standing Ruby queue that proved the pattern years ago.

When the default background-job backend of a major web framework is this exact query, "just use Postgres" stops being a hot take and starts being the mainstream, boring, correct choice.


When PostgreSQL Is Enough vs When You Actually Need Redis

The honest answer is that most applications never need Redis for their queue.

You do not need Redis if:

  • You process fewer than ~50,000 jobs per hour.
  • Your latency tolerance is above ~50 ms (a poll interval, not a microsecond budget).
  • Your jobs touch the database anyway (almost all of them do).
  • You want transactional exactly-once effects without writing Lua.

You might need Redis if:

  • You are pushing hundreds of thousands of jobs per minute.
  • Sub-millisecond dequeue latency is a hard requirement.
  • Your jobs are purely in-memory and never touch the database at all.

For rough calibration: a single PostgreSQL instance running the SKIP LOCKED pattern comfortably sustains on the order of tens of thousands of jobs per hour on commodity hardware, while a tuned Redis and BullMQ setup reaches into the hundreds of thousands. Redis is faster. That was never in question. The question is whether that gap matters for your workload. Treat these numbers as orders of magnitude, not benchmarks. Your mileage depends entirely on payload size, hardware, and how much real work each job does.

If you are a startup processing payment webhooks, sending emails, running nightly reconciliation, or syncing data, you are not in Redis territory. You are in Postgres territory, with room to spare.

Hand-drawn flowchart: four questions; all "no" means you already have your queue, any "yes" means consider Redis


Shopify Said It First (In Production)

On May 12, 2026, Shopify published an engineering post describing how they replaced Redis with a MySQL-native SKIP LOCKED implementation for their inventory reservation system. (MySQL shipped the same SKIP LOCKED clause in 8.0, so this is not a Postgres-only idea. It is in the spirit of the SQL standard, and both engines implement it.)

Their reasoning was the same one this whole post is built on. At their scale, for that workload, adding Redis meant adding a system that had to be operated, monitored, and kept consistent with the primary database, and the operational cost was not paid back by the performance gain. A team handling millions of inventory reservations a day found their relational database, used correctly, was already enough.


I Made the Same Call, Independently

On May 22, 2026, ten days after Shopify's post and before I had read it, I made the same architectural decision while building paystable.

Paystable is an open-source Go daemon that sits between Indian payment gateways and your application, enforcing reliability guarantees the gateways themselves do not provide. At its core is a stabilization engine: when a failure webhook arrives, paystable does not act on it immediately. It enqueues the event and polls the gateway's status API on jittered exponential backoff until the status holds stable across several consecutive checks.

That stabilization engine is a job queue. I needed workers claiming jobs without duplicates, retries with backoff, and crash recovery. The full set.

The reflex was to reach for Redis. Add go-redis, run a container, use a list as a queue. Twenty minutes of setup.

I did not, and the reason was a hard line in paystable's design philosophy: no Kafka, no Redis, no NATS. One fewer moving part is one fewer thing to wake up to at midnight during a college fest when traffic spikes. Paystable is meant to be a single Go binary backed by a single PostgreSQL database. Every extra dependency is a deployment cost paid by whoever runs it, and that is often a student on a small box.

So I looked at what Postgres already gave me. SELECT ... FOR UPDATE SKIP LOCKED solved concurrency. An outbox table handled delivery guarantees. The WAL handled crash recovery. The entire queue was three extra columns on a table.

The stabilization queue, in full:

// ClaimJob atomically claims one pending job for processing.
// Returns nil if no jobs are available.
func (q *Queue) ClaimJob(ctx context.Context) (*Job, error) {
    var job Job
    err := q.db.QueryRowContext(ctx, `
        WITH claimed AS (
            SELECT id
            FROM stabilization_jobs
            WHERE status = 'pending'
              AND run_after <= NOW()
            ORDER BY created_at
            LIMIT 1
            FOR UPDATE SKIP LOCKED
        )
        UPDATE stabilization_jobs
        SET status = 'processing', claimed_at = NOW()
        FROM claimed
        WHERE stabilization_jobs.id = claimed.id
        RETURNING stabilization_jobs.*
    `).Scan(&job.ID, &job.TxnID, &job.Status, &job.Attempts,
        &job.RunAfter, &job.ClaimedAt, &job.Payload)
 
    if err == sql.ErrNoRows {
        return nil, nil
    }
    return &job, err
}
 
// Reschedule moves a failed job back to pending with exponential backoff.
func (q *Queue) Reschedule(ctx context.Context, id string, attempts int) error {
    _, err := q.db.ExecContext(ctx, `
        UPDATE stabilization_jobs
        SET status    = 'pending',
            attempts  = $2,
            run_after = NOW() + ($3 * INTERVAL '1 second')
        WHERE id = $1
    `, id, attempts, backoff(attempts))
    return err
}
 
func backoff(attempts int) int {
    // 5s -> 10s -> 20s -> 40s -> 80s -> 160s (capped)
    delay := 5 * (1 << attempts)
    if delay > 160 {
        return 160
    }
    return delay
}

No Redis client. No connection pool to a second system. No separate deployment. Just PostgreSQL doing what it has been able to do since 2016.

When Shopify's post surfaced in my feed afterward, the convergence was not surprising. The constraints push you to the same answer. When you value reliability over raw throughput, when you want exactly-once effects without Lua, when you want the queue to be a first-class citizen of your transactional data model, SKIP LOCKED is the answer.


The Decision Framework

Before you add Redis next time, ask four questions.

1. What is my actual job volume? Under ~50,000 an hour, Postgres covers you with headroom to spare.

2. Do my jobs touch the database anyway? If yes, exactly-once effects with Redis mean wrestling the dual-write problem. With Postgres it is one transaction.

3. How much do I want to operate? Redis is its own deployment, monitoring, persistence config, and failure mode. For a solo project or small team, that cost is real and recurring.

4. What happens when the queue system crashes? With Redis and appendonly yes you can still lose writes between fsyncs. With Postgres, the WAL makes a job durable the moment the transaction commits.

If Postgres wins those four, you already have your queue.


Conclusion

The industry moved to Redis for job queues because naive database queues had a real concurrency bug. FOR UPDATE SKIP LOCKED fixed that bug in 2016, and the ecosystem has been slow, collectively, to update the reflex.

Shopify noticed in production. Rails 8 shipped SolidQueue on this exact foundation and made it the default. River brought it to Go. I bumped into it building paystable under a self-imposed "no extra infrastructure" rule. None of us coordinated. The constraints just point the same direction.

The lesson is not "never use Redis." Redis is a genuinely excellent tool for the problems it is built for, and at real scale you will reach for it deliberately and be glad it exists. The lesson is narrower and more useful. Reaching for it reflexively, before asking whether the database you already run solves the problem, is a habit worth breaking.

Your job queue might already be running. You just have not written the SKIP LOCKED query yet.


A blog from someone working to change the reliability guarantees of Indian payment gateways.

Samith Reddy
Written by Samith Reddy

Creating with code. Small details matter.

Comments

Join the discussion on GitHub Discussions. Sign in with your GitHub account to leave a comment.