I'm sorry to report this previously reported problem is happening again, starting shortly after pg_upgrading a customer to PG10.1 from 9.6.5. As $subject: backends are stuck in startup for minutes at a time. I didn't strace this time, but I believe last time I saw one was waiting in a futex. postgres 25271 17665 0 16:54 ? 00:00:00 postgres: oooo oooo a.b.c.246(59980) startup postgres 27123 17665 0 16:55 ? 00:00:00 postgres: mmmmmmmm mmmmmmmm a.b.c.247(59919) startup postgres 27142 17665 0 16:55 ? 00:00:00 postgres: main main a.b.c.100(42892) startup postgres 27224 17665 0 16:56 ? 00:00:00 postgres: main main a.b.c.100(42898) startup postgres 27270 17665 0 16:56 ? 00:00:00 postgres: wwwwww wwwwww a.b.c.252(39760) startup postgres 27271 17665 0 16:56 ? 00:00:00 postgres: iiiiiii iiiiiii a.b.c.248(37224) startup postgres 27873 17665 0 16:58 ? 00:00:00 postgres: wwwwwwww wwwwwwww a.b.c.253(59667) startup postgres 30038 17665 0 17:00 ? 00:00:00 postgres: wwwwww wwwwww a.b.c.252(39780) startup postgres 30039 17665 0 17:00 ? 00:00:00 postgres: oooo oooo a.b.c.246(60014) startup postgres 30040 17665 0 17:00 ? 00:00:00 postgres: mmmmmmmm mmmmmmmm a.b.c.247(59951) startup postgres 30041 17665 0 17:00 ? 00:00:00 postgres: mmmmmmmm mmmmmmmm a.b.c.247(59952) startup postgres 30042 17665 0 17:00 ? 00:00:00 postgres: wwwwww wwwwww a.b.c.252(39788) startup postgres 30043 17665 0 17:00 ? 00:00:00 postgres: aaaaa aaaaa a.b.c.245(44819) startup postgres 30044 17665 0 17:00 ? 00:00:00 postgres: nnn nnn a.b.c.250(41317) startup postgres 30045 17665 0 17:00 ? 00:00:00 postgres: mmmmmm mmmmmm a.b.c.249(47876) startup postgres 30046 17665 0 17:00 ? 00:00:00 postgres: nnn nnn a.b.c.250(41320) startup postgres 30048 17665 0 17:00 ? 00:00:00 postgres: aaaaa aaaaa a.b.c.245(44834) startup postgres 30051 17665 0 17:00 ? 00:00:00 postgres: iiiiiii iiiiiii a.b.c.248(37244) startup postgres 30052 17665 0 17:00 ? 00:00:00 postgres: iiiiiii iiiiiii a.b.c.248(37247) startup postgres 30054 17665 0 17:00 ? 00:00:00 postgres: nnn nnn a.b.c.250(41327) startup postgres 30079 17665 0 17:00 ? 00:00:00 postgres: aaaaa aaaaa a.b.c.245(44843) startup postgres 30081 17665 0 17:00 ? 00:00:00 postgres: wwwwww wwwwww a.b.c.252(39790) startup postgres 30083 17665 0 17:00 ? 00:00:00 postgres: pppp pppp a.b.c.251(50397) startup postgres 30084 17665 0 17:00 ? 00:00:00 postgres: main main a.b.c.100(43074) startup postgres 30604 17665 0 17:02 ? 00:00:00 postgres: oooo oooo a.b.c.246(60018) startup postgres 30614 17665 0 17:02 ? 00:00:00 postgres: nnn nnn a.b.c.250(41331) startup postgres 30624 17665 0 17:02 ? 00:00:00 postgres: aaaaa aaaaa a.b.c.245(44847) startup postgres 30841 17665 0 17:03 ? 00:00:00 postgres: mmmmmmmm mmmmmmmm a.b.c.247(59961) startup postgres 31023 17665 0 17:03 ? 00:00:00 postgres: postgres main [local] startup postgres 31024 17665 0 17:03 ? 00:00:00 postgres: wwwwww wwwwww a.b.c.252(39796) startup I found these older, similar reports: https://www.postgresql.org/message-id/flat/4F9AC452.3050102%40pgexperts.com#4F9AC452.3050102@xxxxxxxxxxxxx https://www.postgresql.org/message-id/8896.1337998337%40sss.pgh.pa.us I saved ~40 cores from backends from the most recent incident, which are all essentially identical: (gdb) bt #0 0x0000003f6ee0d930 in ?? () #1 0x00007f19f3401258 in ?? () #2 0x00000000006a4440 in PGSemaphoreLock (sema=0x7f19f3401258) at pg_sema.c:310 #3 0x0000000000714495 in LWLockAcquire (lock=0x7f19f3404f80, mode=LW_SHARED) at lwlock.c:1233 #4 0x00000000006f75e5 in BufferAlloc (smgr=0x1f3bc48, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff6ef1c5af "") at bufmgr.c:1012 #5 ReadBuffer_common (smgr=0x1f3bc48, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff6ef1c5af "") at bufmgr.c:745 #6 0x00000000006f800e in ReadBufferExtended (reln=0x1ee6a08, forkNum=MAIN_FORKNUM, blockNum=1, mode=<value optimized out>, strategy=<value optimized out>) at bufmgr.c:664 #7 0x00000000004c090a in _bt_relandgetbuf (rel=0x1ee6a08, obuf=145, blkno=<value optimized out>, access=1) at nbtpage.c:708 #8 0x00000000004c1284 in _bt_getroot (rel=0x1ee6a08, access=<value optimized out>) at nbtpage.c:298 #9 0x00000000004c46eb in _bt_search (rel=0x1ee6a08, keysz=1, scankey=0x7fff6ef1d040, nextkey=0 '\000', bufP=0x7fff6ef1da4c, access=1, snapshot=0x1f1a648) at nbtsearch.c:103 #10 0x00000000004c5d1d in _bt_first (scan=0x1f1a4a8, dir=ForwardScanDirection) at nbtsearch.c:1028 #11 0x00000000004c39dd in btgettuple (scan=0x1f1a4a8, dir=ForwardScanDirection) at nbtree.c:372 #12 0x00000000004bcf1a in index_getnext_tid (scan=0x1f1a4a8, direction=<value optimized out>) at indexam.c:541 #13 0x00000000004bcfca in index_getnext (scan=0x1f1a4a8, direction=ForwardScanDirection) at indexam.c:679 #14 0x00000000004bc549 in systable_getnext (sysscan=0x1f1ae68) at genam.c:422 #15 0x000000000080fac7 in SearchCatCache (cache=0x1ece848, v1=<value optimized out>, v2=<value optimized out>, v3=<value optimized out>, v4=<value optimized out>) at catcache.c:1178 #16 0x0000000000844e7d in superuser_arg (roleid=33320) at superuser.c:71 #17 0x0000000000832917 in InitPostgres (in_dbname=0x1e9eec8 "iiiiiii", dboid=0, username=0x1e71738 "iiiiiii", useroid=0, out_dbname=0x0) at postinit.c:742 #18 0x00000000007239f2 in PostgresMain (argc=1, argv=<value optimized out>, dbname=0x1e9eec8 "iiiiiii", username=0x1e71738 "iiiiiii") at postgres.c:3735 #19 0x00000000006b7d7a in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4357 #20 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4029 #21 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1753 #22 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1361 #23 0x0000000000638ad0 in main (argc=3, argv=0x1e6f5b0) at main.c:228 (gdb) p *reln $2 = {rd_node = {spcNode = 1663, dbNode = 16405, relNode = 2679}... nep=# SELECT relname FROM pg_class WHERE oid=2679; relname | pg_index_indexrelid_index For the largest DB: main=# \dt+ pg_class|pg_attribute|pg_index pg_catalog | pg_attribute | table | postgres | 227 MB | pg_catalog | pg_class | table | postgres | 6752 kB | pg_catalog | pg_index | table | postgres | 3448 kB | During normal operation, after restarting following this incident: postgres=# SELECT COUNT(1) FROM pg_stat_activity; count | 85 shared_buffers | 2GB max_connections | 400 effective_cache_size | 4GB maintenance_work_mem | 512MB effective_io_concurrency | 2 max_locks_per_transaction | 64 max_parallel_workers_per_gather | 3 password_encryption | scram-sha-256 I can probably collect other diagnostics if needed, but otherwise it's possible I'll need to reload this DB under PG96. ..but, while this issue is occuring, I can't make simple queries (like SELECT * FROM pg_locks) Notes: - we DO use temp tables; it looks like we created up to 90 temp tables each hour; - we're still running analyze following yesterday's pg_upgrade; - I have a script saving something from pg_buffercache, but it doesn't appear to be running during this last incident; - I'm planning do disable sync seqscan for the night (although as I understand that may not matter (?)) - this instance has clients connecting to ~10 DBs; Thanks in advance Justin