Justin Pryzby <pryzby@xxxxxxxxxxxxx> writes: > On Wed, Nov 22, 2017 at 07:43:50PM -0500, Tom Lane wrote: >> My hypothesis about a missed memory barrier would imply that there's (at >> least) one process that's waiting but is not in the lock's wait queue and > Do I have to also check the wait queue to verify? Give a hint/pointer please? Andres probably knows more about this data structure than I do, but I believe that the values in the LWLock's proclist_head field are indexes into the PGProc array, and that the PGProc.lwWaitLink proclist_node fields contain the fore and aft pointers in a doubly-linked list of waiting processes. But chasing through that by hand is going to be darn tedious if there are a bunch of processes queued for the same lock. In any case, if the process is blocked right there and its lwWaiting field is not set, that is sufficient proof of a bug IMO. What is not quite proven yet is why it failed to detect that it'd been woken. I think really the most useful thing at this point is just to wait and see if your SYSV-semaphore build exhibits the same problem or not. If it does not, we can be pretty confident that *something* is wrong with the POSIX-semaphore code, even if my current theory isn't it. >> My theory suggests that any contended use of an LWLock is at risk, >> in which case just running pgbench with about as many sessions as >> you have in the live server ought to be able to trigger it. However, >> that doesn't really account for your having observed the problem >> only during session startup, > Remember, this issue breaks existing sessions, too. Well, once one session is hung up, anything else that came along wanting access to that same LWLock would also get stuck. Since the lock in question is a buffer partition lock controlling access to something like 1/128'th of the shared buffer pool, it would not take too long for every active session to get stuck there, whether it were doing anything related or not. In any case, if you feel like trying the pgbench approach, I'd suggest setting up a script to run a lot of relatively short runs rather than one long one. If there is something magic about the first blockage in a session, that would help catch it. > Am I right this won't help for lwlocks? ALTER SYSTEM SET log_lock_waits=yes Nope, that's just for heavyweight locks. LWLocks are lightweight precisely because they don't have stuff like logging, timeouts, or deadlock detection ... regards, tom lane