Hi The undetected deadlock occurred again today and I was able to collect some more info. The presentation was very similar to the case I reported previously: - One backend trying to do a DROP TABLE on a partition of a partitioned table, waiting for an AccessExclusiveLock on that table - Another backend trying to do a DELETE on records within a partition of another table, waiting on an AccessShareLock which is already held by the first table and already holding a lock on the table that the first backend is waiting for - A load of other backends also trying to do a DELETE. I was able to attach gdb to the backends and I discovered a few things. First, taking a stack trace of the process doing the DELETE, it looks like this: #0 0x00007fc82a6750bb in epoll_wait (epfd=13, events=0x18696f8, maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x00000000007d0c32 in WaitEventSetWaitBlock (nevents=1, occurred_events=<optimized out>, cur_timeout=-1, set=0x1869698) at latch.c:1450 #2 WaitEventSetWait (set=0x1869698, timeout=<optimized out>, occurred_events=<optimized out>, nevents=1, wait_event_info=<optimized out>) at latch.c:1396 #3 0x00000000007d0f94 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=0, wait_event_info=50331648) at latch.c:473 #4 0x00000000007eaea7 in ProcSleep (locallock=locallock@entry=0x24571d0, lockMethodTable=lockMethodTable@entry=0xdb2360 <default_lockmethod>) at proc.c:1361 #5 0x00000000007df419 in WaitOnLock (locallock=locallock@entry=0x24571d0, owner=owner@entry=0x18b80f8) at lock.c:1858 #6 0x00000000007e052c in LockAcquireExtended (locktag=locktag@entry=0x7ffced1d0b80, lockmode=lockmode@entry=1, sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false, reportMemoryError=reportMemoryError@entry=true, locallockp=locallockp@entry=0x7ffced1d0b78) at lock.c:1100 #7 0x00000000007ddb23 in LockRelationOid (relid=1842227607, lockmode=1) at lmgr.c:117 #8 0x000000000050dba5 in relation_open (relationId=1842227607, lockmode=lockmode@entry=1) at relation.c:56 #9 0x00000000008fc838 in generate_partition_qual (rel=0x56614e8) at partcache.c:361 #10 generate_partition_qual (rel=0x56614e8) at partcache.c:336 #11 0x00000000007502bf in set_baserel_partition_constraint (relation=relation@entry=0x56614e8, rel=<optimized out>, rel=<optimized out>) at plancat.c:2402 ~snip~ The lock is being taken out by the function generate_partition_qual (in partcache.c) which has this comment: /* * Grab at least an AccessShareLock on the parent table. Must do this * even if the partition has been partially detached, because transactions * concurrent with the detach might still be trying to use a partition * descriptor that includes it. */ This is happening during query planning rather than execution, where it is trying to fetch the partition bound expression. It explains why this lock is not always required (the result is cached in the relcache, and anyway it looks to me like this code path is called depending on factors such as the row estimates). The second thing was that the DROP process, and all the DELETE processes, were all waiting inside epoll_wait ultimately called from ProcSleep as shown above. It is sitting in this section of code (proc.c line 1361): (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0, <--- Processes sitting here PG_WAIT_LOCK | locallock->tag.lock.locktag_type); ResetLatch(MyLatch); /* check for deadlocks first, as that's probably log-worthy */ if (got_deadlock_timeout) <-- set breakpoint here { CheckDeadLock(); got_deadlock_timeout = false; } I set a breakpoint on the next line after ResetLatch is called, and it was never reached. I thought that could be due to GDB attaching to the process and disrupting the alarm signal that should be sent into the timeout handler, so I manually sent it a SIGALRM. At that point the breakpoint was triggered - but the flag got_deadlock_timeout is not set, so the deadlock check is not executed. This led me to believe that the deadlock check timeout wasn't set properly at all. To see if the deadlock would be detected if CheckDeadlock() got an opportunity to run, I manually set got_deadlock_timeout, and sure enough the deadlock was detected and released. I concluded that the deadlock detection function is able to detect these deadlocks, but for some reason the 1sec timeout is not being correctly scheduled so it never runs. When it is artificially triggered, it does release the deadlock. After the release, a new deadlock formed between the DROP process and another one of the waiting DELETE processes (not an entirely unexpected outcome). I will leave it in that state for as long as possible in case anyone can think of any other info that should be gathered. Luckily it is a test instance. Sorry for the slightly long and detailed email - let me know if this should be moved to another forum. Cheers Mike