Re: autovacuum on primary blocking queries on replica?

Laurenz Albe <laurenz.albe@xxxxxxxxxxx> · Sun, 29 May 2022 17:02:28 +0200

On Fri, 2022-05-27 at 14:00 -0500, Don Seiler wrote:
>  * PostgreSQL 12.9 - PGDG Ubuntu 18.04 image
>  * Streaming physical replication
>  * hot_standby_feedback = on
> We use a read replica to offload a lot of (what should be) quick queries. This morning we had an incident
> where these queries were all blocking on AccessShareLock waits, written to the log as:
> 
> 2022-05-27 15:23:53.476 UTC [8185] foo@foo_all - myapp LOG:  process 8185 still waiting for AccessShareLock on relation 16834 of database 16401 after 1000.228 ms at character 204
> 2022-05-27 15:23:53.476 UTC [8185] foo@foo_all - myapp DETAIL:  Process holding the lock: 10822. Wait queue: 32373, 8185, 13782, [...]
> This went on for 30 seconds (the value of max_standby_streaming_delay) until PG killed the blocking process:
> 2022-05-27 15:24:22.474 UTC [10822] foo@foo_all - anotherapp FATAL:  terminating connection due to conflict with recovery
> 2022-05-27 15:24:22.474 UTC [10822] foo@foo_all - anotherapp DETAIL:  User was holding a relation lock for too long.
> 
> I'm trying to find the root cause of why this started. We did see an UPDATE [...]

An UPDATE cannot be the problem.

> 15 seconds later we then see an aggressive autovacuum on this table:
> 
> 2022-05-27 15:23:52.507 UTC [30513] LOG:  automatic aggressive vacuum of table "foo_all.public.industry": index scans: 1
>         pages: 252 removed, 323 remain, 0 skipped due to pins, 0 skipped frozen
>         tuples: 8252 removed, 8252 remain, 0 are dead but not yet removable, oldest xmin: 1670999292
>         buffer usage: 12219 hits, 137 misses, 54 dirtied
>         avg read rate: 2.372 MB/s, avg write rate: 0.935 MB/s
>         system usage: CPU: user: 0.14 s, system: 0.00 s, elapsed: 0.45 s
> 
> and less than a second after that is when we see the first AccessShareLock message on the replica.
> 
> I've been reading tales of autovacuum taking an AccessExclusiveLock when truncating empty pages at the end of a table.
> I'm imagining that updating every row of a table and then rolling back would leave all of those rows empty at the end
> and qualify for truncation and lead to the scenario I saw this morning.
> 
> I'm still not entirely satisfied since that table in question was so small (only 8252 rows) so I wouldn't imagine it would
> hold things up as long as it did. Although the blocking session on the replica was an application session,
> not any background/recovery process.

I think you are on the right trail.

VACUUM will truncate trailing pages if it can get a (very short) ACCESS EXCLUSIVE lock on the table.
Now that lock and the truncation is replicated, and they can create a replication conflict just like
you describe.  Even if the lock is held for a very short time, replaying it will conflict with any
query on that table on the standby.

You can disable autovacuum truncation on the table with

   ALTER TABLE foo_all.public.industry SET (vacuum_truncate = off);

if you know that you can do without autovacuum truncation for that query.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com