Re: Too coarse predicate locks granularity for B+ tree indexes

Rinat Shigapov <rinatshigapov@xxxxxxxxx> · Tue, 7 Feb 2023 17:08:26 +0600

Thank you for your prompt reply!

I've mentioned that I've generated ballast data to make the cost optimizer to switch to page-level locks.

But my question is about more finer grained (less then page) predicate locks for indices. With page-level locks I could still get serialization failures if I add more queries (or emulate it with sleeps) to the transaction with the UPDATE Users query.
Below I describe the problem again for psql-general:

I have a concurrent testsuite that runs 14 test cases. Each test case operates on a disjoint set of records, doesn't retry transactions and is run under 'serializable' isolation level. The test data is small and likely fits within a single tuple page.

When I finished the test suite I was surprised that PostgreSQL 14.5 returns serialization failure on every test suite run. I was even more surprised when I tested the suite against the current CockroachDB and didn't get serialization failures. Actually I was able to reproduce RETRY_SERIALIZABLE errors a couple of times on CockroachDB but it required me to run the test suite in a loop for more than a half hour.

I started to investigate the test behavior with PostgreSQL with more simplified and shrinked code and found a serialization failure of two concurrent `update_user` operations.

The test defines the following `Users` table:

CREATE TABLE Users (
    id UUID,
    title VARCHAR(255),
    first_name VARCHAR(40),
    last_name VARCHAR(80) NOT NULL,
    email VARCHAR(255) NOT NULL,
    lower_email VARCHAR(255) GENERATED ALWAYS AS (lower(email)) STORED,
    marketing_optin BOOLEAN,
    mobile_phone VARCHAR(50),
    phone VARCHAR(50),
    phone_ext VARCHAR(40),
    is_contact BOOLEAN DEFAULT false NOT NULL,
    unlinked_link_ids UUID[],

    CONSTRAINT unique_user_email UNIQUE(lower_email),
    PRIMARY KEY (id)
);

Concurrent `update_user` operation run the UPDATE query to change user email to a unique value 

UPDATE Users
SET
    title = CASE WHEN false= true THEN 'foo' ELSE title END,
    first_name = CASE WHEN false= true THEN 'foo' ELSE first_name END,
    last_name = CASE WHEN false= true THEN 'foo' ELSE last_name END,
    email = CASE WHEN true = true THEN 'email2' ELSE email END,
    marketing_optin = CASE WHEN false = true THEN true ELSE marketing_optin END,
    mobile_phone = CASE WHEN false = true THEN 'foo' ELSE mobile_phone END,
    phone = CASE WHEN false = true THEN 'foo' ELSE phone END,
    phone_ext = CASE WHEN false = true THEN 'foo' ELSE phone_ext END
WHERE id = '018629fd-7b28-743c-8647-b6321c166d46';

I use the following helper view to monitor locks:
CREATE VIEW locks_v AS
SELECT pid,
        virtualtransaction,
       locktype,
       CASE locktype
         WHEN 'relation' THEN relation::regclass::text
         WHEN 'virtualxid' THEN virtualxid::text
         WHEN 'transactionid' THEN transactionid::text
         WHEN 'tuple' THEN relation::regclass::text||':'||page::text||':'||tuple::text
         WHEN 'page' THEN relation::regclass::text||':'||page::text
       END AS lockid,
       mode,
       granted
FROM pg_locks;

 When the test Users table has only a few records the query uses a sequential scan the serialization failure is reproducible without inserting sleeps before `update_user` transaction commit.

This is caused by relation level predicate locks on Users table:
select * from locks_v;
 pid  | virtualtransaction |   locktype    |      lockid       |       mode       | granted
------+--------------------+---------------+-------------------+------------------+---------
 3676 | 5/2444             | relation      | unique_user_email | RowExclusiveLock | t
 3676 | 5/2444             | relation      | users_pkey        | RowExclusiveLock | t
 3676 | 5/2444             | relation      | users             | RowExclusiveLock | t
 3676 | 5/2444             | virtualxid    | 5/2444            | ExclusiveLock    | t
 3737 | 4/13470            | relation      | pg_locks          | AccessShareLock  | t
 3737 | 4/13470            | relation      | locks_v           | AccessShareLock  | t
 3737 | 4/13470            | virtualxid    | 4/13470           | ExclusiveLock    | t
 3669 | 3/17334            | relation      | unique_user_email | RowExclusiveLock | t
 3669 | 3/17334            | relation      | users_pkey        | RowExclusiveLock | t
 3669 | 3/17334            | relation      | users             | RowExclusiveLock | t
 3669 | 3/17334            | virtualxid    | 3/17334           | ExclusiveLock    | t
 3676 | 5/2444             | transactionid | 6571              | ExclusiveLock    | t
 3669 | 3/17334            | transactionid | 6570              | ExclusiveLock    | t
 3676 | 5/2444             | relation      | users             | SIReadLock       | t
 3669 | 3/17334            | relation      | users             | SIReadLock       | t
(15 rows)

If I add ballast data to Users table (1000 records) the cost optimizer switches to index scan and it's hard to reproduce the issue for two concurrent `update_user` operations without sleeps. After adding long sleeps after UPDATE query and before commit I could see page-level predicates locks for the primary key index users_pkey:

select * from locks_v;
 pid | virtualtransaction |   locktype    |      lockid       |       mode       | granted
-----+--------------------+---------------+-------------------+------------------+---------
 371 | 6/523              | relation      | unique_user_email | RowExclusiveLock | t
 371 | 6/523              | relation      | users_pkey        | RowExclusiveLock | t
 371 | 6/523              | relation      | users             | RowExclusiveLock | t
 371 | 6/523              | virtualxid    | 6/523             | ExclusiveLock    | t
 381 | 14/215             | relation      | unique_user_email | RowExclusiveLock | t
 381 | 14/215             | relation      | users_pkey        | RowExclusiveLock | t
 381 | 14/215             | relation      | users             | RowExclusiveLock | t
 381 | 14/215             | virtualxid    | 14/215            | ExclusiveLock    | t
 350 | 4/885              | relation      | pg_locks          | AccessShareLock  | t
 350 | 4/885              | relation      | locks_v           | AccessShareLock  | t
 350 | 4/885              | virtualxid    | 4/885             | ExclusiveLock    | t
 371 | 6/523              | transactionid | 1439              | ExclusiveLock    | t
 381 | 14/215             | transactionid | 1431              | ExclusiveLock    | t
 381 | 14/215             | page          | users_pkey:5      | SIReadLock       | t
 371 | 6/523              | page          | users_pkey:5      | SIReadLock       | t
(15 rows)

With sleeps the serialization failure is reproduced on each run.

I started to read more about SSI implementation in PostgreSQL. The article https://arxiv.org/pdf/1208.4179.pdf mentions that
Currently, locks on B+-tree indexes are acquired at page granularity; we intend to refine this to next-key locking [16] in a future release.
[16] C. Mohan. ARIES/KVL: A key-value locking method for concurrency control of multiaction transactions operating on B-tree indexes. In VLDB, pages 392–405, 1990. 

My question follows: 

Does the current PostgreSQL release support B+ tree index predicate locks more granular then page-level locks?

With kindest regards, Rinat Shigapov

вт, 7 февр. 2023 г. в 16:29, Laurenz Albe <laurenz.albe@xxxxxxxxxxx>:
On Tue, 2023-02-07 at 16:23 +0600, Rinat Shigapov wrote:

> I have a concurrent testsuite that runs 14 test cases. Each test case operates

> on a disjoint set of records, doesn't retry transactions and is run under

> 'serializable' isolation level. The test data is small and likely fits within

> a single tuple page.

> 

> When I finished the test suite I was surprised that PostgreSQL 14.5 returns

> serialization failure on every test suite run.

This is no question for the hackers list; redirecting to general.

That behavior sounds perfectly normal to me: if everything is in a single

page, PostgreSQL probably won't use an index scan.  With a sequential scan,

the predicate lock will be on the whole table.  So you should expect

serialization failures.  This is well documented.

Perhaps you should use a more realistic test case with a reasonable

amount of data.

Yours,

Laurenz Albe