G'day all, We have a postgres 11.9 instance serving 7-8k queries/sec. In general it's humming along with no issues (p50 patency of ~1ms; p95 ~30ms). Lately though, we've seen occasional locking that causes all commits to the main database to be briefly blocked and the APIs that it backs to stall. This is what I know so far: * It's happened at varying times of day over the last 3 weeks. Around 10 significant cases (20-30 second stalls) and more shorter ones. * Each time queries are blocked behind an AccessExclusiveLock on the database (class 1262) that is itself blocked. * There's no change in DB throughput at the time. * No schema changes or any other queries are being made that would involve an AccessExclusiveLock. * The blocked queries are all commit statements; our metrics show dozens of traces each time that look perfectly normal except for a 20 second commit somewhere within them. * We make heavy use of listen/notify via que with ~80 notify events per second across 16 listen channels. (https://github.com/que-rb/que) Based on all that and the logs below, the locks in question appear to be held while inserting NOTIFY messages on commit: https://www.postgresql.org/message-id/3598.1363354686%40sss.pgh.pa.us If that's the case, the real question is why a small number of those commits are taking so much longer than expected. Any advice on how to proceed from here much appreciated. Cheers Ben Hoskings --- The observed outage here started at ~00:35:22 and ended when the blocked locks were acquired. 2021-01-27 00:35:15.264 UTC [1554652]: [21-1] db=main,user=runtime LOG: process 1554652 still waiting for ShareLock on transaction 3399854732 after 1000.091 ms 2021-01-27 00:35:15.264 UTC [1554652]: [22-1] db=main,user=runtime DETAIL: Process holding the lock: 1542461. Wait queue: 1554652, 1542154, 1542496. 2021-01-27 00:35:15.264 UTC [1554652]: [23-1] db=main,user=runtime CONTEXT: while rechecking updated tuple (#,#) in relation "devices" 2021-01-27 00:35:15.264 UTC [1554652]: [24-1] db=main,user=runtime STATEMENT: UPDATE "devices" SET "last_successful_request_at" = '2021-01-27 00:35:14.203471', "updated_at" = '2021-01-27 00:35:14.204966' WHERE "devices"."id" = # 2021-01-27 00:35:15.265 UTC [1542154]: [82-1] db=main,user=runtime LOG: process 1542154 still waiting for ShareLock on transaction 3399854732 after 1000.098 ms 2021-01-27 00:35:15.265 UTC [1542154]: [83-1] db=main,user=runtime DETAIL: Process holding the lock: 1542461. Wait queue: 1554652, 1542154, 1542496. 2021-01-27 00:35:15.265 UTC [1542154]: [84-1] db=main,user=runtime CONTEXT: while rechecking updated tuple (#,#) in relation "devices" 2021-01-27 00:35:15.265 UTC [1542154]: [85-1] db=main,user=runtime STATEMENT: UPDATE "devices" SET "last_successful_request_at" = '2021-01-27 00:35:14.226635', "updated_at" = '2021-01-27 00:35:14.227932' WHERE "devices"."id" = # 2021-01-27 00:35:15.267 UTC [1542521]: [93-1] db=main,user=runtime LOG: process 1542521 still waiting for ShareLock on transaction 3399854728 after 1000.067 ms 2021-01-27 00:35:15.267 UTC [1542521]: [94-1] db=main,user=runtime DETAIL: Process holding the lock: 1547170. Wait queue: 1542514, 1542521. 2021-01-27 00:35:15.267 UTC [1542521]: [95-1] db=main,user=runtime CONTEXT: while updating tuple (#,#) in relation "devices" 2021-01-27 00:35:15.267 UTC [1542521]: [96-1] db=main,user=runtime STATEMENT: UPDATE "devices" SET "last_successful_request_at" = '2021-01-27 00:35:14.257563', "updated_at" = '2021-01-27 00:35:14.258934' WHERE "devices"."id" = # 2021-01-27 00:35:15.278 UTC [1479762]: [60-1] db=main,user=runtime LOG: process 1479762 still waiting for AccessExclusiveLock on object 0 of class 1262 of database 0 after 1000.129 ms 2021-01-27 00:35:15.278 UTC [1479762]: [61-1] db=main,user=runtime DETAIL: Process holding the lock: 1518905. Wait queue: 1517976, 1518853, 1445675, 1518975, 1477989, 1518976, 1517975, 3018947, 1468183, 1477990, 1532581, 1474376, 1558501, 1477991, 1473039, 1468182, 1558509, 1477988, 1558516, 3019502, 1518852, 1473037, 1473038, 1547149, 1518909, 1479762, 1479765. 2021-01-27 00:35:15.290 UTC [1479765]: [51-1] db=main,user=runtime LOG: process 1479765 still waiting for AccessExclusiveLock on object 0 of class 1262 of database 0 after 1000.110 ms 2021-01-27 00:35:15.290 UTC [1479765]: [52-1] db=main,user=runtime DETAIL: Process holding the lock: 1518905. Wait queue: 1517976, 1518853, 1445675, 1518975, 1477989, 1518976, 1517975, 3018947, 1468183, 1477990, 1532581, 1474376, 1558501, 1477991, 1473039, 1468182, 1558509, 1477988, 1558516, 3019502, 1518852, 1473037, 1473038, 1547149, 1518909, 1479762, 1479765. 2021-01-27 00:35:15.290 UTC [1479765]: [53-1] db=main,user=runtime STATEMENT: COMMIT 2021-01-27 00:35:15.293 UTC [1542496]: [130-1] db=main,user=runtime LOG: process 1542496 still waiting for ShareLock on transaction 3399854732 after 1000.123 ms 2021-01-27 00:35:15.293 UTC [1542496]: [131-1] db=main,user=runtime DETAIL: Process holding the lock: 1542461. Wait queue: 1554652, 1542154, 1542496. 2021-01-27 00:35:15.293 UTC [1542496]: [132-1] db=main,user=runtime CONTEXT: while updating tuple (#,#) in relation "devices" 2021-01-27 00:35:15.293 UTC [1542496]: [133-1] db=main,user=runtime STATEMENT: UPDATE "devices" SET "last_successful_request_at" = '2021-01-27 00:35:14.283515', "updated_at" = '2021-01-27 00:35:14.284677' WHERE "devices"."id" = # 2021-01-27 00:35:24.507 UTC [32]: [10-1] db=,user= LOG: using stale statistics instead of current ones because stats collector is not responding 2021-01-27 00:35:34.698 UTC [1565612]: [1-1] db=,user= LOG: using stale statistics instead of current ones because stats collector is not responding 2021-01-27 00:35:34.702 UTC [32]: [11-1] db=,user= LOG: using stale statistics instead of current ones because stats collector is not responding 2021-01-27 00:35:35.263 UTC [3485525]: [17-1] db=main,user=datadog LOG: using stale statistics instead of current ones because stats collector is not responding 2021-01-27 00:35:35.263 UTC [3485525]: [18-1] db=main,user=datadog STATEMENT: SELECT psd.datname, numbackends, xact_commit, xact_rollback, blks_read, blks_hit, tup_returned, tup_fetched, tup_inserted, tup_updated, tup_deleted, 2^31 - age(datfrozenxid) as wraparound, deadlocks, temp_bytes, temp_files, pg_database_size(psd.datname) as pg_database_size FROM pg_stat_database psd JOIN pg_database pd ON psd.datname = pd.datname WHERE psd.datname not ilike 'template%%%%' AND psd.datname not ilike 'rdsadmin' AND psd.datname not ilike 'azure_maintenance' AND psd.datname not ilike 'postgres' 2021-01-27 00:35:39.293 UTC [1517976]: [40-1] db=main,user=runtime LOG: process 1517976 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 25247.913 ms 2021-01-27 00:35:39.348 UTC [1542521]: [97-1] db=main,user=runtime LOG: process 1542521 acquired ShareLock on transaction 3399854728 after 25081.166 ms 2021-01-27 00:35:39.348 UTC [1542521]: [98-1] db=main,user=runtime CONTEXT: while updating tuple (#,#) in relation "devices" 2021-01-27 00:35:39.348 UTC [1542521]: [99-1] db=main,user=runtime STATEMENT: UPDATE "devices" SET "last_successful_request_at" = '2021-01-27 00:35:14.257563', "updated_at" = '2021-01-27 00:35:14.258934' WHERE "devices"."id" = # 2021-01-27 00:35:39.348 UTC [1542514]: [88-1] db=main,user=runtime LOG: process 1542514 acquired ShareLock on transaction 3399854728 after 25086.306 ms 2021-01-27 00:35:39.348 UTC [1542514]: [89-1] db=main,user=runtime CONTEXT: while rechecking updated tuple (#,#) in relation "devices" 2021-01-27 00:35:39.348 UTC [1542514]: [90-1] db=main,user=runtime STATEMENT: UPDATE "devices" SET "last_successful_request_at" = '2021-01-27 00:35:14.225360', "updated_at" = '2021-01-27 00:35:14.226612' WHERE "devices"."id" = # 2021-01-27 00:35:39.348 UTC [1542496]: [134-1] db=main,user=runtime LOG: process 1542496 acquired ShareLock on transaction 3399854732 after 25055.414 ms 2021-01-27 00:35:39.348 UTC [1542496]: [135-1] db=main,user=runtime CONTEXT: while updating tuple (#,#) in relation "devices" 2021-01-27 00:35:39.348 UTC [1542496]: [136-1] db=main,user=runtime STATEMENT: UPDATE "devices" SET "last_successful_request_at" = '2021-01-27 00:35:14.283515', "updated_at" = '2021-01-27 00:35:14.284677' WHERE "devices"."id" = # 2021-01-27 00:35:39.348 UTC [1542154]: [86-1] db=main,user=runtime LOG: process 1542154 acquired ShareLock on transaction 3399854732 after 25083.744 ms 2021-01-27 00:35:39.348 UTC [1542154]: [87-1] db=main,user=runtime CONTEXT: while rechecking updated tuple (#,#) in relation "devices" 2021-01-27 00:35:39.348 UTC [1542154]: [88-1] db=main,user=runtime STATEMENT: UPDATE "devices" SET "last_successful_request_at" = '2021-01-27 00:35:14.226635', "updated_at" = '2021-01-27 00:35:14.227932' WHERE "devices"."id" = # 2021-01-27 00:35:39.348 UTC [1554652]: [25-1] db=main,user=runtime LOG: process 1554652 acquired ShareLock on transaction 3399854732 after 25084.652 ms 2021-01-27 00:35:39.348 UTC [1554652]: [26-1] db=main,user=runtime CONTEXT: while rechecking updated tuple (#,#) in relation "devices" 2021-01-27 00:35:39.348 UTC [1554652]: [27-1] db=main,user=runtime STATEMENT: UPDATE "devices" SET "last_successful_request_at" = '2021-01-27 00:35:14.203471', "updated_at" = '2021-01-27 00:35:14.204966' WHERE "devices"."id" = # 2021-01-27 00:35:39.445 UTC [1518853]: [62-1] db=main,user=runtime LOG: process 1518853 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 25399.688 ms 2021-01-27 00:35:39.597 UTC [1445675]: [65-1] db=main,user=runtime LOG: process 1445675 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 25550.833 ms 2021-01-27 00:35:39.751 UTC [1518975]: [57-1] db=main,user=runtime LOG: process 1518975 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 25704.969 ms 2021-01-27 00:35:39.822 UTC [1477989]: [47-1] db=main,user=runtime LOG: process 1477989 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 25775.611 ms 2021-01-27 00:35:39.912 UTC [1518976]: [42-1] db=main,user=runtime LOG: process 1518976 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 25864.312 ms 2021-01-27 00:35:40.040 UTC [1517975]: [55-1] db=main,user=runtime LOG: process 1517975 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 25991.996 ms 2021-01-27 00:35:40.155 UTC [3018947]: [1147-1] db=main,user=runtime LOG: process 3018947 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 26104.830 ms 2021-01-27 00:35:40.207 UTC [1468183]: [40-1] db=main,user=runtime LOG: process 1468183 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 26155.200 ms 2021-01-27 00:35:40.238 UTC [1477990]: [53-1] db=main,user=runtime LOG: process 1477990 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 26181.880 ms 2021-01-27 00:35:40.256 UTC [1532581]: [27-1] db=main,user=runtime LOG: process 1532581 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 26187.341 ms 2021-01-27 00:35:40.263 UTC [1474376]: [40-1] db=main,user=runtime LOG: process 1474376 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 26191.817 ms 2021-01-27 00:35:40.272 UTC [1558501]: [3-1] db=main,user=runtime LOG: process 1558501 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 26192.658 ms 2021-01-27 00:35:40.279 UTC [1477991]: [61-1] db=main,user=runtime LOG: process 1477991 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 26100.590 ms 2021-01-27 00:35:40.285 UTC [1473039]: [36-1] db=main,user=runtime LOG: process 1473039 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 26106.395 ms 2021-01-27 00:35:40.308 UTC [1468182]: [45-1] db=main,user=runtime LOG: process 1468182 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 26128.550 ms 2021-01-27 00:35:40.329 UTC [1558509]: [3-1] db=main,user=runtime LOG: process 1558509 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 26143.959 ms