race condition when checking uniqueness between two tables

Győző Papp <gyozo.papp@xxxxxxxxxxxxx> · Thu, 28 Feb 2019 22:45:31 +0100

Hi all,

First and foremost I admit what follows is quite awkward and a bit
long. Sorry for both.

However I still would really like to understand what I am missing in
regards of transaction isolation levels and MVCC if CTE involved.

Given a sort of scheduler application with two tables:

CREATE TABLE available_jobs (id UUID PRIMARY KEY, url TEXT NOT NULL);
CREATE TABLE assigned_jobs (id UUID PRIMARY KEY, url TEXT NOT NULL);

New jobs are always inserted into `available_jobs` but only if its id
(primary key) does not exist in neither of the two tables, so for the
time being we used the INSERT below:

INSERT INTO available_jobs(id, url)
SELECT :id, :url
WHERE NOT EXISTS (
      SELECT id FROM assigned_jobs WHERE id = :id
    )
ON CONFLICT DO NOTHING

A periodic task moves rows form `available_jobs` to `assigned_jobs`
the jobs that has been assigned:

WITH jobs_to_assign AS (
  DELETE FROM available_jobs
    USING v_assignable_jobs
    WHERE available_jobs.id = v_assignable_jobs.id
    RETURNING available_jobs.*
)
INSERT
  INTO assigned_jobs(id, url)
  SELECT id, url
    FROM jobs_to_assign
    RETURNING *;

And a similar one that “unassigns” failed assigned jobs, that is,
pushes back rows from `assigned_jobs` to `available_jobs`. And it
sometimes fails with:

ERROR: duplicate key value violates unique constraint "jobs_pkey"
DETAIL: Key (id)=(1fd0626c-f953-3278-82a1-8e4320d28914) already
exists.

>From the application logs we suspect that "duplicates" sneak in when
assignment takes a bit longer and overlaps with new job insertion with
a primary key already in use.

My first question is: how could it happen, which phenomenon are we
risking/experiencing in these cases?

Moreover, to my biggest surprise when we added at first blush a
superfluous condition to the INSERT statement (partly for fun partly
because of being clueless):

INSERT INTO available_jobs(id, url)
SELECT :id, :url
WHERE NOT EXISTS (
      SELECT id FROM assigned_jobs WHERE id = :id
    ) AND NOT EXISTS (
      SELECT id FROM available_jobs WHERE id = :id
    )
ON CONFLICT DO NOTHING

we did never spot such duplicates  that always otherwise with the same
load and data distribution with a fairly long test period. My next
question would be what it adds to the full picture that prevents
creating a duplicate that ON CONFLICT DO NOTHING could not?

Many thanks for any hints, insights or suggestions. Sorry again for
being so long.

Notes:
* We are using 10.6 in AWS RDS.
* Each transaction is in READ_COMMITTED.
* I simplified the original table layout a lot for simplicity and
unfortunately had no time to reproduce the original phenomenon.
* `v_assignable_jobs` view does the job selection based on other
tables including `assigned_jobs` as well.
* Long time back we had one single jobs table only but job selection
had poor performance at high load we decided to split the table into
two. This is the explanation of the primary key name `jobs_pkey`.

--
Gyozo Papp

-- 
The
 information contained in this email may be confidential. It has been 

sent for the sole use of the intended recipient(s). If the
reader of this 
email is not an intended recipient, you are hereby 
notified that any 
unauthorized review, use, disclosure, dissemination, 
distribution, or 
copying of this message is strictly prohibited. If you 
have received this 
email in error, please notify
the sender immediately and destroy all copies 
of the message.