backends stuck in "startup"

Justin Pryzby <pryzby@xxxxxxxxxxxxx> · Tue, 21 Nov 2017 17:09:26 -0600

I'm sorry to report this previously reported problem is happening again,
starting shortly after pg_upgrading a customer to PG10.1 from 9.6.5.

As $subject: backends are stuck in startup for minutes at a time.  I didn't
strace this time, but I believe last time I saw one was waiting in a futex.

postgres 25271 17665  0 16:54 ?        00:00:00 postgres: oooo oooo a.b.c.246(59980) startup    
postgres 27123 17665  0 16:55 ?        00:00:00 postgres: mmmmmmmm mmmmmmmm a.b.c.247(59919) startup
postgres 27142 17665  0 16:55 ?        00:00:00 postgres: main main a.b.c.100(42892) startup    
postgres 27224 17665  0 16:56 ?        00:00:00 postgres: main main a.b.c.100(42898) startup    
postgres 27270 17665  0 16:56 ?        00:00:00 postgres: wwwwww wwwwww a.b.c.252(39760) startup
postgres 27271 17665  0 16:56 ?        00:00:00 postgres: iiiiiii iiiiiii a.b.c.248(37224) startup
postgres 27873 17665  0 16:58 ?        00:00:00 postgres: wwwwwwww wwwwwwww a.b.c.253(59667) startup
postgres 30038 17665  0 17:00 ?        00:00:00 postgres: wwwwww wwwwww a.b.c.252(39780) startup
postgres 30039 17665  0 17:00 ?        00:00:00 postgres: oooo oooo a.b.c.246(60014) startup    
postgres 30040 17665  0 17:00 ?        00:00:00 postgres: mmmmmmmm mmmmmmmm a.b.c.247(59951) startup
postgres 30041 17665  0 17:00 ?        00:00:00 postgres: mmmmmmmm mmmmmmmm a.b.c.247(59952) startup
postgres 30042 17665  0 17:00 ?        00:00:00 postgres: wwwwww wwwwww a.b.c.252(39788) startup
postgres 30043 17665  0 17:00 ?        00:00:00 postgres: aaaaa aaaaa a.b.c.245(44819) startup  
postgres 30044 17665  0 17:00 ?        00:00:00 postgres: nnn nnn a.b.c.250(41317) startup      
postgres 30045 17665  0 17:00 ?        00:00:00 postgres: mmmmmm mmmmmm a.b.c.249(47876) startup
postgres 30046 17665  0 17:00 ?        00:00:00 postgres: nnn nnn a.b.c.250(41320) startup      
postgres 30048 17665  0 17:00 ?        00:00:00 postgres: aaaaa aaaaa a.b.c.245(44834) startup  
postgres 30051 17665  0 17:00 ?        00:00:00 postgres: iiiiiii iiiiiii a.b.c.248(37244) startup
postgres 30052 17665  0 17:00 ?        00:00:00 postgres: iiiiiii iiiiiii a.b.c.248(37247) startup
postgres 30054 17665  0 17:00 ?        00:00:00 postgres: nnn nnn a.b.c.250(41327) startup      
postgres 30079 17665  0 17:00 ?        00:00:00 postgres: aaaaa aaaaa a.b.c.245(44843) startup  
postgres 30081 17665  0 17:00 ?        00:00:00 postgres: wwwwww wwwwww a.b.c.252(39790) startup
postgres 30083 17665  0 17:00 ?        00:00:00 postgres: pppp pppp a.b.c.251(50397) startup    
postgres 30084 17665  0 17:00 ?        00:00:00 postgres: main main a.b.c.100(43074) startup    
postgres 30604 17665  0 17:02 ?        00:00:00 postgres: oooo oooo a.b.c.246(60018) startup    
postgres 30614 17665  0 17:02 ?        00:00:00 postgres: nnn nnn a.b.c.250(41331) startup      
postgres 30624 17665  0 17:02 ?        00:00:00 postgres: aaaaa aaaaa a.b.c.245(44847) startup  
postgres 30841 17665  0 17:03 ?        00:00:00 postgres: mmmmmmmm mmmmmmmm a.b.c.247(59961) startup
postgres 31023 17665  0 17:03 ?        00:00:00 postgres: postgres main [local] startup               
postgres 31024 17665  0 17:03 ?        00:00:00 postgres: wwwwww wwwwww a.b.c.252(39796) startup

I found these older, similar reports:
https://www.postgresql.org/message-id/flat/4F9AC452.3050102%40pgexperts.com#4F9AC452.3050102@xxxxxxxxxxxxx
https://www.postgresql.org/message-id/8896.1337998337%40sss.pgh.pa.us

I saved ~40 cores from backends from the most recent incident, which are all
essentially identical:

(gdb) bt
#0  0x0000003f6ee0d930 in ?? ()
#1  0x00007f19f3401258 in ?? ()
#2  0x00000000006a4440 in PGSemaphoreLock (sema=0x7f19f3401258) at pg_sema.c:310
#3  0x0000000000714495 in LWLockAcquire (lock=0x7f19f3404f80, mode=LW_SHARED) at lwlock.c:1233
#4  0x00000000006f75e5 in BufferAlloc (smgr=0x1f3bc48, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, 
    hit=0x7fff6ef1c5af "") at bufmgr.c:1012
#5  ReadBuffer_common (smgr=0x1f3bc48, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff6ef1c5af "")
    at bufmgr.c:745
#6  0x00000000006f800e in ReadBufferExtended (reln=0x1ee6a08, forkNum=MAIN_FORKNUM, blockNum=1, mode=<value optimized out>, 
    strategy=<value optimized out>) at bufmgr.c:664
#7  0x00000000004c090a in _bt_relandgetbuf (rel=0x1ee6a08, obuf=145, blkno=<value optimized out>, access=1) at nbtpage.c:708
#8  0x00000000004c1284 in _bt_getroot (rel=0x1ee6a08, access=<value optimized out>) at nbtpage.c:298
#9  0x00000000004c46eb in _bt_search (rel=0x1ee6a08, keysz=1, scankey=0x7fff6ef1d040, nextkey=0 '\000', bufP=0x7fff6ef1da4c, access=1, snapshot=0x1f1a648)
    at nbtsearch.c:103
#10 0x00000000004c5d1d in _bt_first (scan=0x1f1a4a8, dir=ForwardScanDirection) at nbtsearch.c:1028
#11 0x00000000004c39dd in btgettuple (scan=0x1f1a4a8, dir=ForwardScanDirection) at nbtree.c:372
#12 0x00000000004bcf1a in index_getnext_tid (scan=0x1f1a4a8, direction=<value optimized out>) at indexam.c:541
#13 0x00000000004bcfca in index_getnext (scan=0x1f1a4a8, direction=ForwardScanDirection) at indexam.c:679
#14 0x00000000004bc549 in systable_getnext (sysscan=0x1f1ae68) at genam.c:422
#15 0x000000000080fac7 in SearchCatCache (cache=0x1ece848, v1=<value optimized out>, v2=<value optimized out>, v3=<value optimized out>, 
    v4=<value optimized out>) at catcache.c:1178
#16 0x0000000000844e7d in superuser_arg (roleid=33320) at superuser.c:71
#17 0x0000000000832917 in InitPostgres (in_dbname=0x1e9eec8 "iiiiiii", dboid=0, username=0x1e71738 "iiiiiii", useroid=0, out_dbname=0x0) at postinit.c:742
#18 0x00000000007239f2 in PostgresMain (argc=1, argv=<value optimized out>, dbname=0x1e9eec8 "iiiiiii", username=0x1e71738 "iiiiiii") at postgres.c:3735
#19 0x00000000006b7d7a in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4357
#20 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4029
#21 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1753
#22 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1361
#23 0x0000000000638ad0 in main (argc=3, argv=0x1e6f5b0) at main.c:228

(gdb) p *reln
$2 = {rd_node = {spcNode = 1663, dbNode = 16405, relNode = 2679}...

nep=# SELECT relname FROM pg_class WHERE oid=2679;
relname | pg_index_indexrelid_index

For the largest DB:
main=# \dt+ pg_class|pg_attribute|pg_index
 pg_catalog | pg_attribute | table | postgres | 227 MB  | 
 pg_catalog | pg_class     | table | postgres | 6752 kB | 
 pg_catalog | pg_index     | table | postgres | 3448 kB | 

During normal operation, after restarting following this incident:

postgres=# SELECT COUNT(1) FROM pg_stat_activity;
count | 85

shared_buffers | 2GB
max_connections | 400
effective_cache_size | 4GB
maintenance_work_mem | 512MB
effective_io_concurrency | 2
max_locks_per_transaction | 64
max_parallel_workers_per_gather | 3
password_encryption | scram-sha-256

I can probably collect other diagnostics if needed, but otherwise it's possible
I'll need to reload this DB under PG96.

..but, while this issue is occuring, I can't make simple queries
(like SELECT * FROM pg_locks)

Notes:
 - we DO use temp tables; it looks like we created up to 90 temp tables each hour;
 - we're still running analyze following yesterday's pg_upgrade;
 - I have a script saving something from pg_buffercache, but it doesn't appear
   to be running during this last incident;
 - I'm planning do disable sync seqscan for the night (although as I understand
   that may not matter (?))
 - this instance has clients connecting to ~10 DBs;

Thanks in advance
Justin