On Tue, Nov 30, 2021 at 5:16 AM James Sewell <james.sewell@xxxxxxxxxxxx> wrote: >> >> How did you verify that, maybe some process started IO and stuck >> there? Can we check pg_stat_activity that is there some process that >> shows in the wait event as SLRURead/SLRUWrite and not coming out of >> that state? Looking again into the back trace[1], it appeared that the backend is getting stuck while getting the subtransaction's parent information, and that can only happen if the snapshot is getting marked as overflowed. So it seems that some of the scripts are using a lot of sub-transaction (> 64) within a transaction and in such case, we can not cache the sub-transaction information and the snapshot will be marked overflowed. After that, for checking the visibility of any tuple we need to consult the SLRU to get TopMost transaction information and if the subtransaction SLRU cache is full then it will have to do the I/O. So in this case it appeared that a lot of parallel scans/new connections are trying to read the pg_subtrans and due to limited cache size that is not even able to load the data in the cache and getting frequently blocked. So I assume that when you say the startup is getting stuck forever, how long did you wait, maybe it is moving but getting stuck again for the next tuple, so to confirm that attach the process with gdb put breakpoint in XidInMVCCSnapshot() and check whether it is hitting that breakpoint again or not. Also, check whether you are seeing "SubtransControlLock" in pg_stat_activity. There is one nice blog[2] explaining this issue so you can refer to that as well. [1] 21 #2 0x00007f18fb9dec6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 20 #3 0x000000000071c482 in PGSemaphoreLock () 19 #4 0x000000000079078c in LWLockAcquire () 18 #5 0x00000000004fa9ae in SimpleLruWaitIO.isra.1 () 17 #6 0x00000000004fb2a4 in SimpleLruReadPage () 16 #7 0x00000000004fbc07 in SubTransGetParent () 15 #8 0x00000000004fbc57 in SubTransGetTopmostTransaction () 14 #9 0x00000000008f65a7 in XidInMVCCSnapshot () 13 #10 0x00000000008f662e in HeapTupleSatisfiesMVCC () 12 #11 0x00000000004c436e in heapgetpage () [2]https://www.cybertec-postgresql.com/en/subtransactions-and-performance-in-postgresql/ -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com