Hi, We just had an incident on one of our non-production databases where 14 unrelated queries were all hung in wait event IPC / ParallelFinish. We had systematically called pg_cancel/terminate_backend on all other backends except these (and
the autovacuum process mentioned below) to make sure there wasn’t some other resource that they were deadlocked on. We attached gdb to a number of the backends, and found their backtraces to look like this: #0 0x00007f9ea3e77903 in __epoll_wait_nocancel () from /lib64/libc.so.6 #1 0x000000000077cb5e in WaitEventSetWait () #2 0x000000000077d149 in WaitLatch () #3 0x00000000004f1d75 in WaitForParallelWorkersToFinish () #4 0x00000000006294e7 in ExecParallelFinish () #5 0x000000000063a57d in ExecShutdownGather () …
#6 0x0000000000629978 in ExecShutdownNode () <-- Then zero or more of #7 0x0000000000676c01 in planstate_tree_walker () <-- this pair … #10 0x0000000000629925 in ExecShutdownNode () #11 0x000000000062494e in standard_ExecutorRun () #12 0x00007f9e99d73f5d in pgss_ExecutorRun () from /remote/install/sw/external/20180117-4-64/lib/pg_stat_statements.so #13 0x00000000007a5c24 in PortalRunSelect () #14 0x00000000007a7316 in PortalRun () #15 0x00000000007a2b49 in exec_simple_query () #16 0x00000000007a4157 in PostgresMain () #17 0x000000000047926f in ServerLoop () #18 0x00000000007200cc in PostmasterMain () #19 0x000000000047af97 in main () We also sent one of the backends a SIGABRT, so we have a core dump to play with. The only other backend running at the time was an autovacuum process, which may also have been hung - it didn’t have a wait event in pg_stat_activity, but
I didn’t get a chance to strace it or attach gdb as the database restarted itself after we sent the SIGABRT. The host is running Postgres v10.1 on RHEL7.4. Any ideas what could have caused this, or what we could do to investigate this further? Thanks, Steve. |