On Wed, Nov 22, 2017 at 07:43:50PM -0500, Tom Lane wrote: > Justin Pryzby <pryzby@xxxxxxxxxxxxx> writes: > > For starters, I found that PID 27427 has: > > > (gdb) p proc->lwWaiting > > $1 = 0 '\000' > > (gdb) p proc->lwWaitMode > > $2 = 1 '\001' > > To confirm, this is LWLockAcquire's "proc", equal to MyProc? > If so, and if LWLockAcquire is blocked at PGSemaphoreLock, > that sure seems like a smoking gun. Right: (gdb) bt #0 0x0000003f6ee0d930 in ?? () #1 0x00007f19f3400d38 in ?? () #2 0x00000000006a4440 in PGSemaphoreLock (sema=0x7f19f3400d38) at pg_sema.c:310 #3 0x0000000000714495 in LWLockAcquire (lock=0x7f19f3404f80, mode=LW_SHARED) at lwlock.c:1233 #4 0x00000000006f75e5 in BufferAlloc (smgr=0x1f3c368, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff6ef1c5af "") at bufmgr.c:1012 #5 ReadBuffer_common (smgr=0x1f3c368, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff6ef1c5af "") at bufmgr.c:745 [...] (gdb) up #1 0x00007f19f3400d38 in ?? () (gdb) up #2 0x00000000006a4440 in PGSemaphoreLock (sema=0x7f19f3400d38) at pg_sema.c:310 310 errStatus = sem_wait(PG_SEM_REF(sema)); (gdb) up #3 0x0000000000714495 in LWLockAcquire (lock=0x7f19f3404f80, mode=LW_SHARED) at lwlock.c:1233 1233 PGSemaphoreLock(proc->sem); (gdb) p proc->lwWaiting $1 = 0 '\000' (gdb) p proc->lwWaitMode $2 = 1 '\001' (gdb) p proc $3 = (PGPROC *) 0x7f1a77dba500 (gdb) p MyProc $4 = (PGPROC *) 0x7f1a77dba500 (gdb) p MyProc==proc $5 = 1 I suppose one needs to show that this backend was stuck and not "just happened to start 1ms before gdb connected", which I actually have: postgres 27427 17665 0 16:57 ? 00:00:00 postgres: indigo indigo 207.241.169.248(37226) startup ... postgres 30298 17665 0 17:01 ? 00:00:00 postgres: main main 207.241.169.100(43088) startup ..so that process was in "startup" 3-4 minutes after being forked. > My hypothesis about a missed memory barrier would imply that there's (at > least) one process that's waiting but is not in the lock's wait queue and Do I have to also check the wait queue to verify? Give a hint/pointer please? > > Would you suggest how I can maximize the likelyhood/speed of triggering that ? > > Five years ago, with a report of similar symptoms, you said "You need to hack > > pgbench to suppress the single initialization connection it normally likes to > > make, else the test degenerates to the one-incoming-connection case" > > https://www.postgresql.org/message-id/8896.1337998337%40sss.pgh.pa.us > > I don't think that case was related at all. > > My theory suggests that any contended use of an LWLock is at risk, > in which case just running pgbench with about as many sessions as > you have in the live server ought to be able to trigger it. However, > that doesn't really account for your having observed the problem > only during session startup, Remember, this issue breaks existing sessions, too. After it'd happened a couple times, I made sure to leave a session opened to allow collecting diagnostics (and try to un-stick it), but when it recurred, was unable to even SELECT * FROM pg_locks. Am I right this won't help for lwlocks? ALTER SYSTEM SET log_lock_waits=yes Justin