On Fri, 2022-05-27 at 14:00 -0500, Don Seiler wrote: > * PostgreSQL 12.9 - PGDG Ubuntu 18.04 image > * Streaming physical replication > * hot_standby_feedback = on > We use a read replica to offload a lot of (what should be) quick queries. This morning we had an incident > where these queries were all blocking on AccessShareLock waits, written to the log as: > > 2022-05-27 15:23:53.476 UTC [8185] foo@foo_all - myapp LOG: process 8185 still waiting for AccessShareLock on relation 16834 of database 16401 after 1000.228 ms at character 204 > 2022-05-27 15:23:53.476 UTC [8185] foo@foo_all - myapp DETAIL: Process holding the lock: 10822. Wait queue: 32373, 8185, 13782, [...] > This went on for 30 seconds (the value of max_standby_streaming_delay) until PG killed the blocking process: > 2022-05-27 15:24:22.474 UTC [10822] foo@foo_all - anotherapp FATAL: terminating connection due to conflict with recovery > 2022-05-27 15:24:22.474 UTC [10822] foo@foo_all - anotherapp DETAIL: User was holding a relation lock for too long. > > I'm trying to find the root cause of why this started. We did see an UPDATE [...] An UPDATE cannot be the problem. > 15 seconds later we then see an aggressive autovacuum on this table: > > 2022-05-27 15:23:52.507 UTC [30513] LOG: automatic aggressive vacuum of table "foo_all.public.industry": index scans: 1 > pages: 252 removed, 323 remain, 0 skipped due to pins, 0 skipped frozen > tuples: 8252 removed, 8252 remain, 0 are dead but not yet removable, oldest xmin: 1670999292 > buffer usage: 12219 hits, 137 misses, 54 dirtied > avg read rate: 2.372 MB/s, avg write rate: 0.935 MB/s > system usage: CPU: user: 0.14 s, system: 0.00 s, elapsed: 0.45 s > > and less than a second after that is when we see the first AccessShareLock message on the replica. > > I've been reading tales of autovacuum taking an AccessExclusiveLock when truncating empty pages at the end of a table. > I'm imagining that updating every row of a table and then rolling back would leave all of those rows empty at the end > and qualify for truncation and lead to the scenario I saw this morning. > > I'm still not entirely satisfied since that table in question was so small (only 8252 rows) so I wouldn't imagine it would > hold things up as long as it did. Although the blocking session on the replica was an application session, > not any background/recovery process. I think you are on the right trail. VACUUM will truncate trailing pages if it can get a (very short) ACCESS EXCLUSIVE lock on the table. Now that lock and the truncation is replicated, and they can create a replication conflict just like you describe. Even if the lock is held for a very short time, replaying it will conflict with any query on that table on the standby. You can disable autovacuum truncation on the table with ALTER TABLE foo_all.public.industry SET (vacuum_truncate = off); if you know that you can do without autovacuum truncation for that query. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com