> 于 2013/8/20 12:34, Jeff Janes 写道:
>
> > How long had they been hanging there? It makes a big difference whether
> > there are several hanging there at one moment, but a few milliseconds later
> > there are several different ones, versus the same few that hang around of
> > many seconds or minutes at a time.
>
> The hanging connections never disappear until I restart pgbouncer. It's like
> this, At minute 1, 3 connections left. At minute 2, another 3 left, total 6.
> Another minute, another 3 left, total 9....till the limit reaches.
OK, that certainly does sound like network problems and not disk contention. But what I don't see is why it would be listed as "active" in pg_stat_activity. If it is blocked on a network connection, I would think it would show 'idle'.
>
> > If it is the same few processes hung for long periods, I would strace them,
> > or gdb them and get a backtrace.
>
> any detail guide to use strace/gdb on pg process?
strace and gdb are aggressive forms of monitoring, so you should talk to your sys admins before using them (if your organization has that type of segregation of duties) or read their documentation.
It is best to run them on a test system if you can get the problem to occur there. If you can't, then I would be willing to run them against a hung backend on a production system, but I might be more adventurous than most.
I have occasionally seen strace cause the straced program to seg-fault, but I've only seen this with GUI programs and never with postgresql.
In theory attaching gdb could make a backend pause while holding a spinlock, and if left that way for long enough could cause a postgresql-wide panic over a stuck spinlock. But I've never seen this happen without intentionally causing. If the backend you attach is truly hung and hasn't already causes a panic, then it almost surely can't be happen, but just to be sure you should quit gdb as soon as possible so that the backend can continue unimpeded.
Anyway, both are fairly easy to run once you find the pid of one of a stuck backend (either using ps, or using pg_stat_activity). Then you give the pid to the debugging program with -p option.
You probably have to run as the postgres user, or you won't have permissions to attach to the backend.
With strace, once you attach you will see a stream of system calls go to your screen, until you hit ctrl-C to detach. But if the backend is hung on a network connection, there should really only be one system call, and then it just wait until you detach or the network connection times out, like this:
$ strace -p 21116
Process 21116 attached - interrupt to quit
recvfrom(9,
So it is waiting on a recvfrom call, and that call never returns until I got sick of waiting and hit ctrl-C. Not very interesting, but it does show it is indeed stuck on the network
for gdb, it is similar to invoke:
$ gdb -p 21116
and it then produces several screenfuls of diagnostic gibberish and gives you an interactive command-line environment. Once attached, you want to get a backtrace ("bt", return), and then quit promptly ("q", return, "y").
That produces something like this:
Loaded symbols for /lib64/libnss_files.so.2
0x00000032a80e9672 in __libc_recv (fd=<value optimized out>, buf=0xb33f60, n=8192, flags=0) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:30
30 return INLINE_SYSCALL (recvfrom, 6, fd, buf, n, flags, NULL, NULL);
(gdb) bt
#0 0x00000032a80e9672 in __libc_recv (fd=<value optimized out>, buf=0xb33f60, n=8192, flags=0) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:30
#1 0x00000000005a4846 in secure_read (port=0x22f1190, ptr=0xb33f60, len=8192) at be-secure.c:304
#2 0x00000000005ae33b in pq_recvbuf () at pqcomm.c:824
#3 0x00000000005ae73b in pq_getbyte () at pqcomm.c:865
#4 0x0000000000651c11 in SocketBackend (argc=<value optimized out>, argv=<value optimized out>, dbname=0x22d2a28 "jjanes", username=<value optimized out>)
at postgres.c:342
#5 ReadCommand (argc=<value optimized out>, argv=<value optimized out>, dbname=0x22d2a28 "jjanes", username=<value optimized out>) at postgres.c:490
#6 PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x22d2a28 "jjanes", username=<value optimized out>) at postgres.c:3923
#7 0x000000000060e861 in BackendRun () at postmaster.c:3614
#8 BackendStartup () at postmaster.c:3304
#9 ServerLoop () at postmaster.c:1367
#10 0x00000000006113b1 in PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1127
#11 0x00000000005b0440 in main (argc=3, argv=0x22d1cb0) at main.c:199
(gdb) q
A debugging session is active.
Inferior 1 [process 21116] will be detached.
Quit anyway? (y or n) y
Detaching from program: /usr/local/pgsql9_2/bin/postgres, process 21116
Cheers,
Jeff