Search Postgresql Archives

Re: backends stuck in "startup"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Nov 21, 2017 at 03:40:27PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2017-11-21 17:09:26 -0600, Justin Pryzby wrote:
> > I'm sorry to report this previously reported problem is happening again,
> > starting shortly after pg_upgrading a customer to PG10.1 from 9.6.5.
> > 
> > As $subject: backends are stuck in startup for minutes at a time.  I didn't
> > strace this time, but I believe last time I saw one was waiting in a futex.
> 
> Could you try stracing next time?

I straced all the "startup" PIDs, which were all in futex, without exception:

[pryzbyj@TS-DB ~]$ sudo strace `ps h --sort start_time -o pid,args -u postgres |awk '/startup$/{print "-p"$1}'` 2>&1 |tee -a /tmp/strace-pg-backends

Process 3611 attached
Process 6081 attached
[pid  6081] futex(0x7ff4314000f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  3611] futex(0x7ff431400138, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  3354] futex(0x7ff431400258, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  3117] futex(0x7ff4314002b8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2948] futex(0x7ff431400378, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2925] futex(0x7ff4314003d8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2789] futex(0x7ff431400438, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2752] futex(0x7ff4314004b8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2663] futex(0x7ff431400618, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2661] futex(0x7ff431400638, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2624] futex(0x7ff431400678, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2576] futex(0x7ff4314006f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid   365] futex(0x7ff431400718, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid   326] futex(0x7ff431400798, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32704] futex(0x7ff431400898, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32309] futex(0x7ff431400938, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32292] futex(0x7ff431400a18, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32170] futex(0x7ff431400ad8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32134] futex(0x7ff431400b58, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31987] futex(0x7ff431400bf8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31962] futex(0x7ff431400c58, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31817] futex(0x7ff431400d38, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31807] futex(0x7ff431400d98, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29664] futex(0x7ff431400e98, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29462] futex(0x7ff431400ef8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29426] futex(0x7ff431400f38, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29258] futex(0x7ff431401058, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29227] futex(0x7ff431402f58, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29107] futex(0x7ff431401138, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29071] futex(0x7ff431401198, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 28942] futex(0x7ff431401258, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 28924] futex(0x7ff431401298, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 26942] futex(0x7ff431401418, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 25278] futex(0x7ff431401438, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24909] futex(0x7ff431401598, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24908] futex(0x7ff4314015b8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24852] futex(0x7ff4314015f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24851] futex(0x7ff431401618, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24585] futex(0x7ff4314016f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24584] futex(0x7ff431401718, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24478] futex(0x7ff431401918, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24476] futex(0x7ff431401958, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24470] futex(0x7ff4314019d8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24464] futex(0x7ff431401ad8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24458] futex(0x7ff431401bb8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24456] futex(0x7ff431401b98, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24450] futex(0x7ff431401c58, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24448] futex(0x7ff431401c98, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24439] futex(0x7ff431401cf8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24323] futex(0x7ff431401e38, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24322] futex(0x7ff431401e58, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23057] futex(0x7ff431401fd8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23055] futex(0x7ff431402038, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21826] futex(0x7ff4314021f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21824] futex(0x7ff431402238, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21794] futex(0x7ff431402298, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21792] futex(0x7ff431402638, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21791] futex(0x7ff431402af8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21790] futex(0x7ff431402c98, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21786] futex(0x7ff431402478, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21785] futex(0x7ff431402758, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21784] futex(0x7ff431403098, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21768] futex(0x7ff431402718, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21662] futex(0x7ff431402338, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21648] futex(0x7ff431402c38, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21630] futex(0x7ff4314024d8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21591] futex(0x7ff431402278, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21590] futex(0x7ff431402bf8, FUTEX_WAIT, 0, NULL

Also:

[pryzbyj@TS-DB ~]$ date;ps -O lstart,wchan=9999999999999999999999 -u postgres --sort start_time |grep startup$
Wed Nov 22 00:47:41 EST 2017
 1990 Wed Nov 22 00:36:44 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.247(53934) startup
 2007 Wed Nov 22 00:36:44 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.247(53945) startup
 2014 Wed Nov 22 00:36:44 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.247(53946) startup
 2042 Wed Nov 22 00:36:48 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.248(51353) startup
 2043 Wed Nov 22 00:36:48 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.252(52122) startup
 2047 Wed Nov 22 00:36:49 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.253(46696) startup
 2049 Wed Nov 22 00:36:49 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.249(60135) startup
 2052 Wed Nov 22 00:36:50 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.247(53947) startup
 2095 Wed Nov 22 00:37:13 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.248(51355) startup
 2096 Wed Nov 22 00:37:14 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.249(60137) startup
 2097 Wed Nov 22 00:37:15 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.247(53949) startup
 2178 Wed Nov 22 00:37:43 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.253(46698) startup
 2188 Wed Nov 22 00:38:00 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.252(52124) startup
 2331 Wed Nov 22 00:38:30 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.248(51357) startup
 2332 Wed Nov 22 00:38:30 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.249(60139) startup
 2333 Wed Nov 22 00:38:31 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.247(53951) startup
 2529 Wed Nov 22 00:39:00 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.253(46700) startup
 2530 Wed Nov 22 00:39:01 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.248(51358) startup
 2533 Wed Nov 22 00:39:01 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.247(53952) startup
 2535 Wed Nov 22 00:39:01 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.252(52125) startup
 2549 Wed Nov 22 00:39:01 2017 futex_wait_queue_me    S ?        00:00:00 postgres: xx xx xx.253(46701) startup
[...]

> Don't think it's conclusive that they're all continually waiting rather than
> contending on a few locks.  As soon as you attach to one backend a contended
> lock situation can quickly turn into standstill if you end up interrupting a
> process that holds a lock.

I didn't understand what you mean.

By "attach" do you mean attach with strace?  gdb?  Or ??

I also tried stracing all (400) postgres PIDs, which didn't magically unlock them..
sudo strace `ps h --sort start_time -o pid,args -u postgres |awk '/|startup$/{print "-p"$1}'` 2>> /tmp/strace-pg-backends3 &

FYI these AREN'T being rebuilt very often:

[pryzbyj@TS-DB ~]$ sudo sh -c "find /var/lib/pgsql/ -name 'pg_internal.init' -print0 |xargs -r0 ls -lt"
-rw------- 1 postgres postgres 136164 Nov 21 21:50 /var/lib/pgsql/10/data/base/16404/pg_internal.init
-rw------- 1 postgres postgres  21340 Nov 21 21:50 /var/lib/pgsql/10/data/global/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:47 /var/lib/pgsql/10/data/base/16406/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:32 /var/lib/pgsql/10/data/base/16410/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:28 /var/lib/pgsql/10/data/base/16401/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:21 /var/lib/pgsql/10/data/base/16403/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:13 /var/lib/pgsql/10/data/base/13451/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:13 /var/lib/pgsql/10/data/base/16408/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:59 /var/lib/pgsql/10/data/base/16411/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:56 /var/lib/pgsql/10/data/base/16402/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:01 /var/lib/pgsql/10/data/base/1/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:01 /var/lib/pgsql/10/data/base/16409/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:00 /var/lib/pgsql/10/data/base/16407/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:00 /var/lib/pgsql/10/data/base/16405/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 19:59 /var/lib/pgsql/10/data/base/16400/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 17:17 /var/lib/pgsql/10/data/base/13450/pg_internal.init
[pryzbyj@TS-DB ~]$ date
Wed Nov 22 00:45:14 EST 2017

And it's not a minor overhead:

[pryzbyj@TS-DB ~]$ time psql newn -c 'SELECT 1'
psql: FATAL:  terminating connection due to administrator command
real    10m22.089s

Justin




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux