Re: Is there a known bug with SKIP LOCKED and "tuple to be locked was already moved to another partition due to concurrent update"?

Kamil Dziedzic <kamil@xxxxxxxxxxx> · Thu, 11 Mar 2021 22:12:37 +0100

    Hi everyone,
Is this a well know bug? I just hit the same issue. Below are steps to reproduce.

Create data table and partition it by state
CREATE TABLE data (
   id bigserial not null,
   state smallint not null DEFAULT 1,
   updated_at timestamp without time zone default now()
) partition by list(state);
create table data_pending partition of data for values in (1);
create table data_processing partition of data for values in (2);
create table data_done partition of data for values in (3);
Generate test data
INSERT INTO transactions
SELECT generate_series(1,1000) AS id, 1 AS state, NOW();
Move data from pending state to processing in batches
UPDATE transactions
SET transactions.state = 2, updated_at = NOW()
WHERE id = (
    SELECT id
    FROM transactions
    WHERE state = 1
    LIMIT 10 FOR UPDATE SKIP LOCKED
)
RETURNING id;
You can now process them in application and move to done state when finished.
However, this doesn't work as FOR UPDATE SKIP LOCKED fails for partitioned tables with following error

ERROR:  tuple to be locked was already moved to another partition due to concurrent update

Here is full script to test this
cat > test-skip-locked-with-partitions << 'SCRIPT'
PSQL_CMD="psql -q -U postgres"

eval $PSQL_CMD > /dev/null << EOF
CREATE TABLE IF NOT EXISTS data (
   id bigserial not null,
   state smallint not null DEFAULT 1,
   updated_at timestamp without time zone default now()
) partition by list(state);
CREATE TABLE IF NOT EXISTS data_pending partition of data for values in (1);
CREATE TABLE IF NOT EXISTS data_processing partition of data for values in (2);
CREATE TABLE IF NOT EXISTS data_done partition of data for values in (3);

INSERT INTO data
SELECT generate_series(1,1000) AS id, 1 AS state, NOW();
EOF

function run {
eval $PSQL_CMD > /dev/null << EOF
    UPDATE data
    SET state = 2, updated_at = NOW()
    WHERE id IN (
        SELECT id
        FROM data
        WHERE state = 1
        LIMIT 10 FOR UPDATE SKIP LOCKED
    )
    RETURNING id;
EOF
}

for i in {1..100}; do (run &) done
SCRIPT
bash test-skip-locked-with-partitions
Which results in
root@e92e3fa7fecf:/# bash test-skip-locked-with-partitions
ERROR:  tuple to be locked was already moved to another partition due to concurrent update
ERROR:  tuple to be locked was already moved to another partition due to concurrent update
ERROR:  tuple to be locked was already moved to another partition due to concurrent update
ERROR:  tuple to be locked was already moved to another partition due to concurrent update
ERROR:  tuple to be locked was already moved to another partition due to concurrent update
ERROR:  tuple to be locked was already moved to another partition due to concurrent update
ERROR:  tuple to be locked was already moved to another partition due to concurrent update
ERROR:  tuple to be locked was already moved to another partition due to concurrent update
ERROR:  tuple to be locked was already moved to another partition due to concurrent update
This works well if table is not partitioned.
However, query with WHERE state=1 has hard time when there are millions of records.
Adding index doesn't help as Query Planner tends to ignore it and do Seq Scan anyway.
Workaround for now is to create separate queue table, which is like custom partitioning.
INSERT SAD CAT PICTURE
https://gist.github.com/arvenil/b46e927c943fa7495780ea2ae5492e78

Best, Kamil Dziedzic

On Thu, Mar 11, 2021 at 10:08 PM Jim Jarvie <jim@xxxxxxxxxxxxxx> wrote:

    Hi Gunther & List,
    I think I have an extremely similar issue and things point in the
      same direction of a potential issue for skip locked on partitioned
      tables.
    Background is I had a queue table on v9.6 with fairly high volume
      (>50M items, growth in the 1+M/daily).
    Processing the queue with FOR UPDATE SKIP LOCKED was reliable but
      traffic volumes on v9.6 and the fact v12 is current let to
      migrating to v12 and using a partitioned table.
    Queue has distinct categories of items, so the table is
      partitioned by list on each category.  Processing in 1 category
      results in it being updated to the next logical category (i.e. it
      will migrate partition once it is processed).

    Within each category, there can be 10'sM rows, so the list
      partition is hash partitioned as well.  I don't think this is the
      issue but is mentioned for completeness.

    Now, when processing the queue, there are regular transaction
      aborts with "tuple to be locked was already moved to another
      partition due to concurrent update".
    From everything I can trace, it really does look like this is
      caused by rows which should be locked/skipped as they are
      processed by a different thread.
    I tried switching 'for update' to 'for key share' and that
      created a cascade of deadlock aborts, so was worse for my
      situation.

    For now, I roll back and repeat the select for update skip locked
      until it succeeds - which it eventually does.
    However, it really feels like these should just have been skipped
      by PostgreSQL without the rollback/retry until success.

    So, am I missing something/doing it wrong?  Or could there be a
      potential issue that needs raised?
    Thanks
    Jim

    On 30-Jun.-2020 12:10, Gunther Schadow
      wrote:

    Hi
      all,

      long time ago I devised with your help a task queuing system which
      uses SELECT ... FOR UPDATE SKIP LOCKED for many parallel workers
      to find tasks in the queue, and it used a partitioned table where
      the hot part of the queue is short and so the query for a job is
      quick and the skip locked locking makes sure that one job is only
      assigned to one worker. And this works pretty well for me, except
      that when we run many workers we find a lot of these failures
      occurring:

      "tuple to be locked was already moved to another partition due to
      concurrent update"

      This would not exactly look like a bug, because the message says
      "to be locked", so at least it's not allowing two workers to lock
      the same tuple. But it seems that the skip-locked mode should not
      make an error out of this, but treat it as the tuple was already
      locked. Why would it want to lock the tuple (representing the job)
      if another worker has already finished his UPDATE of the job to
      mark it as "done" (which is what makes the tuple move to the
      "completed" partition.)

      Either the SELECT for jobs to do returned a wrong tuple, which was
      already update, or there is some lapse in the locking.

      Either way it would seem to be a waste of time throwing all these
      errors when the tuple should not even have been selected for
      update and locking.

      I wonder if anybody knows anything about that issue? Of course
      you'll want to see the DDL and SQL queries, etc. but you can't
      really try it out unless you do some massively parallel magic. So
      I figured I just ask.

      regards,

      -Gunther