Re: backends stuck in "startup"

Justin Pryzby <pryzby@xxxxxxxxxxxxx> · Wed, 22 Nov 2017 19:24:50 -0600

On Wed, Nov 22, 2017 at 07:43:50PM -0500, Tom Lane wrote:
> Justin Pryzby <pryzby@xxxxxxxxxxxxx> writes:
> > For starters, I found that PID 27427 has:
> 
> > (gdb) p proc->lwWaiting
> > $1 = 0 '\000'
> > (gdb) p proc->lwWaitMode
> > $2 = 1 '\001'
> 
> To confirm, this is LWLockAcquire's "proc", equal to MyProc?
> If so, and if LWLockAcquire is blocked at PGSemaphoreLock,
> that sure seems like a smoking gun.

Right:

(gdb) bt
#0  0x0000003f6ee0d930 in ?? ()
#1  0x00007f19f3400d38 in ?? ()
#2  0x00000000006a4440 in PGSemaphoreLock (sema=0x7f19f3400d38) at pg_sema.c:310
#3  0x0000000000714495 in LWLockAcquire (lock=0x7f19f3404f80, mode=LW_SHARED) at lwlock.c:1233
#4  0x00000000006f75e5 in BufferAlloc (smgr=0x1f3c368, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff6ef1c5af "") at bufmgr.c:1012
#5  ReadBuffer_common (smgr=0x1f3c368, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff6ef1c5af "") at bufmgr.c:745
[...]

(gdb) up
#1  0x00007f19f3400d38 in ?? ()
(gdb) up
#2  0x00000000006a4440 in PGSemaphoreLock (sema=0x7f19f3400d38) at pg_sema.c:310
310                     errStatus = sem_wait(PG_SEM_REF(sema));
(gdb) up
#3  0x0000000000714495 in LWLockAcquire (lock=0x7f19f3404f80, mode=LW_SHARED) at lwlock.c:1233
1233                            PGSemaphoreLock(proc->sem);
(gdb) p proc->lwWaiting
$1 = 0 '\000'
(gdb) p proc->lwWaitMode
$2 = 1 '\001'
(gdb) p proc
$3 = (PGPROC *) 0x7f1a77dba500
(gdb) p MyProc
$4 = (PGPROC *) 0x7f1a77dba500
(gdb) p MyProc==proc
$5 = 1

I suppose one needs to show that this backend was stuck and not "just
happened to start 1ms before gdb connected", which I actually have:
postgres 27427 17665  0 16:57 ?        00:00:00 postgres: indigo indigo 207.241.169.248(37226) startup
...
postgres 30298 17665  0 17:01 ?        00:00:00 postgres: main main 207.241.169.100(43088) startup    

..so that process was in "startup" 3-4 minutes after being forked.

> My hypothesis about a missed memory barrier would imply that there's (at
> least) one process that's waiting but is not in the lock's wait queue and

Do I have to also check the wait queue to verify?  Give a hint/pointer please?

> > Would you suggest how I can maximize the likelyhood/speed of triggering that ?
> > Five years ago, with a report of similar symptoms, you said "You need to hack
> > pgbench to suppress the single initialization connection it normally likes to
> > make, else the test degenerates to the one-incoming-connection case"
> > https://www.postgresql.org/message-id/8896.1337998337%40sss.pgh.pa.us
> 
> I don't think that case was related at all.
> 
> My theory suggests that any contended use of an LWLock is at risk,
> in which case just running pgbench with about as many sessions as
> you have in the live server ought to be able to trigger it.  However,
> that doesn't really account for your having observed the problem
> only during session startup,

Remember, this issue breaks existing sessions, too.

After it'd happened a couple times, I made sure to leave a session opened to
allow collecting diagnostics (and try to un-stick it), but when it recurred,
was unable to even SELECT * FROM pg_locks.

Am I right this won't help for lwlocks? ALTER SYSTEM SET log_lock_waits=yes

Justin