On Wed, Mar 6, 2019 at 1:39 AM Boris Sagadin <boris@xxxxxxxxxxxxx> wrote: > PgSQL 10.7, Ubuntu 16.04 LTS > > Symptoms: > > - server accepts new queries until connections exhausted (all queries are SELECT) > - queries are active, never end, but no disk IO > - queries can't be killed with kill -TERM or pg_terminate_backend() > - system load is minimal (vmstat shows 100% idle) > - perf top shows nothing > - statement_timeout is ignored > - no locks with SELECT relation::regclass, * FROM pg_locks WHERE NOT GRANTED; > - server exits only on kill -9 > - strace on SELECT process indefinitely shows: > > futex(0x7f00fe94c938, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff^Cstrace: Process 121319 detached > <detached ...> > > GDB backtrace: > > (gdb) bt > #0 0x00007f05256f1827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f00fe94ba38) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 > #1 do_futex_wait (sem=sem@entry=0x7f00fe94ba38, abstime=0x0) at sem_waitcommon.c:111 > #2 0x00007f05256f18d4 in __new_sem_wait_slow (sem=0x7f00fe94ba38, abstime=0x0) at sem_waitcommon.c:181 > #3 0x00007f05256f197a in __new_sem_wait (sem=sem@entry=0x7f00fe94ba38) at sem_wait.c:29 > #4 0x000055c9b95eb792 in PGSemaphoreLock (sema=0x7f00fe94ba38) at pg_sema.c:316 > #5 0x000055c9b965eaec in LWLockAcquire (lock=0x7f00fe96f880, mode=mode@entry=LW_EXCLUSIVE) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/storage/lmgr/lwlock.c:1233 > #6 0x000055c9b96497f7 in dsm_create (size=size@entry=105544, flags=flags@entry=1) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/storage/ipc/dsm.c:493 > #7 0x000055c9b94139ff in InitializeParallelDSM (pcxt=pcxt@entry=0x55c9bb8d9d58) at /build/postgresql-10-BKASGd/postgresql-10-10.7/build/../src/backend/access/transam/parallel.c:268 Hello Boris, This looks like a known symptom of a pair of bugs we recently tracked down and fixed: 1. "dsa_area could not attach to segment": dsm.c, fixed in commit 6c0fb941. 2. "cannot unpin a segment that is not pinned": dsm.c, fixed in commit 0b55aaac. Do you see one of those messages earlier in your logs? The bug was caused by a failure to allow for a corner case where a new shared memory segment has the same ID as a recently/concurrently destroyed one, and is made more likely to occur by not-very-random random numbers. If one of these errors occurs while cleaning up from a parallel query, a backend can self-deadlock while trying to cleanup the same thing again in the error-handling path, and then other backends will later block on that lock if they try to run a parallel query. The fix will be included in the next set of releases, but in the meantime you could consider turning off parallel query (set max_parallel_workers_per_gather = 0). In practice I think you could also avoid this problem by loading a library that calls something like srandom(getpid()) in _PG_init() (so it runs in every parallel worker making ID collisions extremely unlikely), but that's not really a serious recommendation since it requires writing C code. -- Thomas Munro https://enterprisedb.com