1) A "ps" to check what processes were running. Many "postmaster" processes in defunct state. A few postgres connections still working. A few "psql" (by shell script) queries hanged (they are scripts to monitor state of database. We use big-sister to do so).
The normal state of the server is 1 postmaster pid and many "postgres" connections in "idle" state.
This the result when we were having the problem. I've excluded irrelevant processes.
postgres 892 3889 0 09:07 ? 00:01:05 postgres: postgres international 10.19.0.51(49264) idlepostgres 934 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>postgres 935 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>postgres 936 3889 0 12:00 ? 00:00:00 [postmaster] <defunct>postgres 938 3889 0 12:00 ? 00:00:01 [postmaster] <defunct>postgres 1244 3889 0 12:01 ? 00:00:01 [postmaster] <defunct>postgres 1453 3889 0 12:02 ? 00:00:02 [postmaster] <defunct>postgres 1563 3889 0 12:03 ? 00:00:10 [postmaster] <defunct>postgres 1862 3889 0 12:04 ? 00:00:04 [postmaster] <defunct>postgres 2275 3889 0 12:06 ? 00:00:02 [postmaster] <defunct>root 2287 49 0 Oct29 ? 00:00:00 [kondemand/0]root 2288 49 0 Oct29 ? 00:00:00 [kondemand/1]postgres 2485 3889 0 12:07 ? 00:00:05 postgres: postgres international 10.19.0.51(59066) idlepostgres 2490 3889 1 12:07 ? 00:00:26 postgres: postgres international 10.19.0.51(59070) idlepostgres 2568 3889 0 12:07 ? 00:00:00 postgres: postgres international 10.19.0.50(1249) idlepostgres 3544 3889 0 12:12 ? 00:00:07 [postmaster] <defunct>postgres 3875 3889 0 12:14 ? 00:00:03 [postmaster] <defunct>postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/datapostgres 3896 3889 0 Nov16 ? 00:00:00 postgres: logger processpostgres 3913 3889 0 Nov16 ? 00:00:05 postgres: writer processpostgres 3914 3889 0 Nov16 ? 00:00:00 postgres: wal writer processpostgres 3915 3889 0 Nov16 ? 00:00:01 postgres: autovacuum launcher processpostgres 3916 3889 0 Nov16 ? 00:01:42 postgres: stats collector processpostgres 4169 3889 0 12:15 ? 00:00:02 [postmaster] <defunct>postgres 4170 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>postgres 4171 3889 0 12:15 ? 00:00:07 [postmaster] <defunct>postgres 4281 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>postgres 4672 3889 0 12:17 ? 00:00:01 [postmaster] <defunct>postgres 5038 3889 0 12:19 ? 00:00:01 [postmaster] <defunct>postgres 5203 3889 0 12:20 ? 00:00:00 [postmaster] <defunct>postgres 5484 3889 0 12:21 ? 00:00:01 [postmaster] <defunct>postgres 5694 3889 0 12:22 ? 00:00:01 postgres: postgres international 10.19.0.49(29113) idlepostgres 5769 3889 0 12:23 ? 00:00:04 [postmaster] <defunct>postgres 5832 3889 0 12:23 ? 00:00:03 postgres: postgres international 10.19.0.49(29129) idlepostgres 5835 3889 0 12:23 ? 00:00:02 [postmaster] <defunct>postgres 5979 3889 0 12:24 ? 00:00:01 [postmaster] <defunct>postgres 6243 3889 0 12:25 ? 00:00:00 [postmaster] <defunct>postgres 6424 3889 0 12:26 ? 00:00:01 postgres: postgres international 10.19.0.49(45160) idlepostgres 6425 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>postgres 6429 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>postgres 6599 3889 0 12:27 ? 00:00:02 [postmaster] <defunct>postgres 6635 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6653 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6654 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6655 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6663 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6664 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6665 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6666 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6667 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6668 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6669 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6670 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6671 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6672 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6673 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6773 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6774 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6778 3889 0 12:28 ? 00:00:01 [postmaster] <defunct>postgres 6779 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6785 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6787 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6788 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6796 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6797 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6837 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6838 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6839 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6840 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6841 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6842 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6843 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6844 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6852 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6859 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6860 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6862 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6872 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6873 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6874 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6910 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6970 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6971 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6972 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6973 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6978 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6990 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6991 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6992 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6993 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7000 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7088 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7089 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7090 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7091 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7094 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7095 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7096 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7122 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7126 3889 0 12:29 ? 00:00:03 postgres: postgres mrs 10.19.0.51(7717) idlepostgres 7127 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>root 7152 7150 0 12:30 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7157 7155 0 12:30 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7168 7166 0 12:31 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7174 7172 0 12:31 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7188 7186 0 12:32 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7196 7194 0 12:32 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7258 7256 0 12:34 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7263 7261 0 12:34 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7308 7306 0 12:35 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7313 7311 0 12:35 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7325 7323 0 12:36 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7330 7328 0 12:36 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7371 7369 0 12:37 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrspostgres 22959 3889 0 11:09 ? 00:00:01 postgres: postgres spain 10.19.0.45(57042) idlepostgres 22961 3889 0 11:09 ? 00:00:22 postgres: postgres mirai 10.19.0.45(57043) idlepostgres 23225 3889 0 11:10 ? 00:00:00 postgres: postgres xmas 10.19.0.45(57061) idlepostgres 25844 3889 0 11:23 ? 00:00:01 postgres: postgres spain 10.19.0.51(11291) idlepostgres 31942 3889 0 11:53 ? 00:00:00 [postmaster] <defunct>postgres 31973 3889 0 11:53 ? 00:00:16 [postmaster] <defunct>
2) We did a strace to the postmaster pid. However we had 2 postmasters not dead
# ps -fea |grep -i postmaster
postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres 7601 3889 0 12:37 ? 00:00:00 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
As soon as we did a "strace" to the 3889 pid everything started to work again. Not sure it was a coincidence but that was the way it was.
# strace -p 3889Process 3889 attached - interrupt to quitselect(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be restarted)--- SIGUSR1 (User defined signal 1) @ 0 (0) ---rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7766rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)--- SIGCHLD (Child exited) @ 0 (0) ---rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 7766wait4(-1, 0x7fffc37259a4, WNOHANG, NULL) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)--- SIGUSR1 (User defined signal 1) @ 0 (0) ---rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7767rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)--- SIGCHLD (Child exited) @ 0 (0) ---
I also straced the other postmaster pid
# strace -p 7601Process 7601 attached - interrupt to quitrecvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL, NULL) = 181brk(0x14ac4000) = 0x14ac4000lseek(3, 0, SEEK_END) = 16130048lseek(5, 0, SEEK_END) = 4661248lseek(3, 0, SEEK_END) = 16130048brk(0x14ae9000) = 0x14ae9000brk(0x14adb000) = 0x14adb000brk(0x14ad7000) = 0x14ad7000sendto(7, "\2\0\0\0\230\1\0\0H\203\314\0\4\0\0\0\5\0\0\0\0\0\0\0X\203\314\0\0\0\0\0"..., 408, 0, NULL, 0) = 408sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 74, 0, NULL, 0) = 74recvfrom(8, "P\0\0\0\213\0select id_key from transla"..., 8192, 0, NULL, NULL) = 175brk(0x14ac4000) = 0x14ac4000lseek(3, 0, SEEK_END) = 16130048lseek(5, 0, SEEK_END) = 4661248lseek(3, 0, SEEK_END) = 16130048brk(0x14ae9000) = 0x14ae9000brk(0x14adb000) = 0x14adb000brk(0x14ad7000) = 0x14ad7000sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 73, 0, NULL, 0) = 73recvfrom(8, <unfinished ...>Process 7601 detached
3) Checked the logs. We could see the same error lines we had before, but they dissapeared after just a few minutes.
2010-11-18 12:30:02.830 CET|1|mirai|10.19.0.43(4452)|3875|idle|4ce50a7c.f23|2010-11-18 12:14:04 CET|67/0|0|postgres| LOG: unexpected EOF on client connection2010-11-18 12:30:02.841 CET|1|mirai|10.19.0.43(13943)|2275|idle|4ce508b4.8e3|2010-11-18 12:06:28 CET|75/0|0|postgres| LOG: unexpected EOF on client connection2010-11-18 12:30:47.628 CET|2|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:31:48.056 CET|3|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:32:48.184 CET|4|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:33:48.614 CET|5|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:34:48.747 CET|6|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:35:49.277 CET|7|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:36:49.706 CET|8|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:37:30.462 CET|1|spain|10.19.0.48(11276)|7414|idle|4ce50ffa.1cf6|2010-11-18 12:37:30 CET|2/0|0|spainadmin| LOG: unexpected EOF on client connection
Any Ideas?
Thanks for your help and feedback!
Pablo
2010/11/16 Alvaro Herrera <alvherre@xxxxxxxxxxxxxxxxx>
Excerpts from Pablo Delgado Díaz-Pache's message of mar nov 16 12:18:09 -0300 2010:
> Not sure what you mean.Yeah. This means that the autovacuum launcher daemon is still running
>
> Once we start getting that error in the postgres log, there is no autovacuum
> entry. Only that error message once every minute (and as i wrote in my last
> email, the same message all the time).
and trying to schedule new jobs, but it's not able to for some reason.
When it tries to start a new one, it sees the old one still waiting to
start running; and then it logs this message.
To start a new job, what it does is send a signal to postmaster and wait
for it to create a child process. This is the part that's not working.
I have no idea what's going on. I have never seen this problem. If you
> The question is what is causing postgres to stop working. Autovacuum or
> another reason? If it's another reason ... what could it be?
see it happen again, please get a stacktrace from postmaster; and/or
strace it.
--
Álvaro Herrera <alvherre@xxxxxxxxxxxxxxxxx>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support