# netstat -an | grep 3306tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTENtcp 0 0 10.19.0.47:3306 10.19.0.46:33691 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33745 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33710 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33743 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33751 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33690 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33758 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33725 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33739 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33730 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33744 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33688 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33741 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33757 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33712 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33734 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33719 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33738 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33682 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33702 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33701 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33742 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33708 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33703 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33731 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33723 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33695 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33724 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33728 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33689 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33729 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33694 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33720 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33754 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33732 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33705 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33748 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33749 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33733 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33753 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33755 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33726 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33727 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33737 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33693 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33700 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33740 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33747 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33716 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33709 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33697 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33717 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33736 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33715 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33704 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33711 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33746 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33718 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33692 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33707 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33696 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33750 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33681 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33699 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33756 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33714 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33706 SYN_RECVtcp 1 0 10.19.0.47:3306 10.19.0.46:49437 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49436 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49439 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49438 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49433 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49432 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49435 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49434 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49429 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49428 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49431 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49430 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49427 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49453 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49452 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49455 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49454 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49449 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49448 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49451 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49450 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49445 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49444 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49447 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49446 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49441 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49440 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49443 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49442 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49457 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49456 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30397 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30396 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30399 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30398 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30393 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30392 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30395 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30394 TIME_WAITtcp 100 0 10.19.0.47:3306 10.19.0.46:30391 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30409 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30408 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30411 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30410 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30405 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30404 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30407 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30406 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30401 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30400 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30403 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30402 TIME_WAIT
Didn't do the netstat -an for port 5432, but I guess all max_connections for postgres were busy. Just a guess ...
All connections are done from internal servers, and there are no firewall rules in that interface. Just in case here it goes the iptables -L -n -v
# iptables -L -n -vChain INPUT (policy ACCEPT 0 packets, 0 bytes)pkts bytes target prot opt in out source destinationChain FORWARD (policy ACCEPT 0 packets, 0 bytes)pkts bytes target prot opt in out source destinationChain OUTPUT (policy ACCEPT 170M packets, 392G bytes)pkts bytes target prot opt in out source destinationChain mirai (1 references)pkts bytes target prot opt in out source destination30 1800 ACCEPT all -- eth0 * x.x.x.x 0.0.0.0/01023K 61M ACCEPT all -- eth1 * 10.0.0.0/8 0.0.0.0/0
2010/11/18 Pablo Delgado Díaz-Pache <delgadop@xxxxxxxxx>
Well, we had the problem again. This is what we did1) A "ps" to check what processes were running. Many "postmaster" processes in defunct state. A few postgres connections still working. A few "psql" (by shell script) queries hanged (they are scripts to monitor state of database. We use big-sister to do so).The normal state of the server is 1 postmaster pid and many "postgres" connections in "idle" state.This the result when we were having the problem. I've excluded irrelevant processes.postgres 892 3889 0 09:07 ? 00:01:05 postgres: postgres international 10.19.0.51(49264) idlepostgres 934 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>postgres 935 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>postgres 936 3889 0 12:00 ? 00:00:00 [postmaster] <defunct>postgres 938 3889 0 12:00 ? 00:00:01 [postmaster] <defunct>postgres 1244 3889 0 12:01 ? 00:00:01 [postmaster] <defunct>postgres 1453 3889 0 12:02 ? 00:00:02 [postmaster] <defunct>postgres 1563 3889 0 12:03 ? 00:00:10 [postmaster] <defunct>postgres 1862 3889 0 12:04 ? 00:00:04 [postmaster] <defunct>postgres 2275 3889 0 12:06 ? 00:00:02 [postmaster] <defunct>root 2287 49 0 Oct29 ? 00:00:00 [kondemand/0]root 2288 49 0 Oct29 ? 00:00:00 [kondemand/1]postgres 2485 3889 0 12:07 ? 00:00:05 postgres: postgres international 10.19.0.51(59066) idlepostgres 2490 3889 1 12:07 ? 00:00:26 postgres: postgres international 10.19.0.51(59070) idlepostgres 2568 3889 0 12:07 ? 00:00:00 postgres: postgres international 10.19.0.50(1249) idlepostgres 3544 3889 0 12:12 ? 00:00:07 [postmaster] <defunct>postgres 3875 3889 0 12:14 ? 00:00:03 [postmaster] <defunct>postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/datapostgres 3896 3889 0 Nov16 ? 00:00:00 postgres: logger processpostgres 3913 3889 0 Nov16 ? 00:00:05 postgres: writer processpostgres 3914 3889 0 Nov16 ? 00:00:00 postgres: wal writer processpostgres 3915 3889 0 Nov16 ? 00:00:01 postgres: autovacuum launcher processpostgres 3916 3889 0 Nov16 ? 00:01:42 postgres: stats collector processpostgres 4169 3889 0 12:15 ? 00:00:02 [postmaster] <defunct>postgres 4170 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>postgres 4171 3889 0 12:15 ? 00:00:07 [postmaster] <defunct>postgres 4281 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>postgres 4672 3889 0 12:17 ? 00:00:01 [postmaster] <defunct>postgres 5038 3889 0 12:19 ? 00:00:01 [postmaster] <defunct>postgres 5203 3889 0 12:20 ? 00:00:00 [postmaster] <defunct>postgres 5484 3889 0 12:21 ? 00:00:01 [postmaster] <defunct>postgres 5694 3889 0 12:22 ? 00:00:01 postgres: postgres international 10.19.0.49(29113) idlepostgres 5769 3889 0 12:23 ? 00:00:04 [postmaster] <defunct>postgres 5832 3889 0 12:23 ? 00:00:03 postgres: postgres international 10.19.0.49(29129) idlepostgres 5835 3889 0 12:23 ? 00:00:02 [postmaster] <defunct>postgres 5979 3889 0 12:24 ? 00:00:01 [postmaster] <defunct>postgres 6243 3889 0 12:25 ? 00:00:00 [postmaster] <defunct>postgres 6424 3889 0 12:26 ? 00:00:01 postgres: postgres international 10.19.0.49(45160) idlepostgres 6425 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>postgres 6429 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>postgres 6599 3889 0 12:27 ? 00:00:02 [postmaster] <defunct>postgres 6635 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6653 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6654 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6655 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6663 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6664 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6665 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6666 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6667 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6668 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6669 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6670 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6671 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6672 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6673 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6773 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6774 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6778 3889 0 12:28 ? 00:00:01 [postmaster] <defunct>postgres 6779 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6785 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6787 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6788 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6796 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6797 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6837 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6838 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6839 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6840 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6841 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6842 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6843 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6844 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6852 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6859 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6860 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6862 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6872 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6873 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6874 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6910 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6970 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6971 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6972 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6973 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6978 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6990 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6991 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6992 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6993 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7000 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7088 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7089 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7090 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7091 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7094 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7095 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7096 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7122 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7126 3889 0 12:29 ? 00:00:03 postgres: postgres mrs 10.19.0.51(7717) idlepostgres 7127 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>root 7152 7150 0 12:30 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7157 7155 0 12:30 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7168 7166 0 12:31 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7174 7172 0 12:31 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7188 7186 0 12:32 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7196 7194 0 12:32 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7258 7256 0 12:34 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7263 7261 0 12:34 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7308 7306 0 12:35 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7313 7311 0 12:35 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7325 7323 0 12:36 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7330 7328 0 12:36 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7371 7369 0 12:37 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrspostgres 22959 3889 0 11:09 ? 00:00:01 postgres: postgres spain 10.19.0.45(57042) idlepostgres 22961 3889 0 11:09 ? 00:00:22 postgres: postgres mirai 10.19.0.45(57043) idlepostgres 23225 3889 0 11:10 ? 00:00:00 postgres: postgres xmas 10.19.0.45(57061) idlepostgres 25844 3889 0 11:23 ? 00:00:01 postgres: postgres spain 10.19.0.51(11291) idlepostgres 31942 3889 0 11:53 ? 00:00:00 [postmaster] <defunct>postgres 31973 3889 0 11:53 ? 00:00:16 [postmaster] <defunct>2) We did a strace to the postmaster pid. However we had 2 postmasters not dead# ps -fea |grep -i postmasterpostgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/datapostgres 7601 3889 0 12:37 ? 00:00:00 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/dataAs soon as we did a "strace" to the 3889 pid everything started to work again. Not sure it was a coincidence but that was the way it was.# strace -p 3889Process 3889 attached - interrupt to quitselect(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be restarted)--- SIGUSR1 (User defined signal 1) @ 0 (0) ---rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7766rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)--- SIGCHLD (Child exited) @ 0 (0) ---rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 7766wait4(-1, 0x7fffc37259a4, WNOHANG, NULL) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)--- SIGUSR1 (User defined signal 1) @ 0 (0) ---rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7767rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)--- SIGCHLD (Child exited) @ 0 (0) ---I also straced the other postmaster pid# strace -p 7601Process 7601 attached - interrupt to quitrecvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL, NULL) = 181brk(0x14ac4000) = 0x14ac4000lseek(3, 0, SEEK_END) = 16130048lseek(5, 0, SEEK_END) = 4661248lseek(3, 0, SEEK_END) = 16130048brk(0x14ae9000) = 0x14ae9000brk(0x14adb000) = 0x14adb000brk(0x14ad7000) = 0x14ad7000sendto(7, "\2\0\0\0\230\1\0\0H\203\314\0\4\0\0\0\5\0\0\0\0\0\0\0X\203\314\0\0\0\0\0"..., 408, 0, NULL, 0) = 408sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 74, 0, NULL, 0) = 74recvfrom(8, "P\0\0\0\213\0select id_key from transla"..., 8192, 0, NULL, NULL) = 175brk(0x14ac4000) = 0x14ac4000lseek(3, 0, SEEK_END) = 16130048lseek(5, 0, SEEK_END) = 4661248lseek(3, 0, SEEK_END) = 16130048brk(0x14ae9000) = 0x14ae9000brk(0x14adb000) = 0x14adb000brk(0x14ad7000) = 0x14ad7000sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 73, 0, NULL, 0) = 73recvfrom(8, <unfinished ...>Process 7601 detached3) Checked the logs. We could see the same error lines we had before, but they dissapeared after just a few minutes.2010-11-18 12:30:02.830 CET|1|mirai|10.19.0.43(4452)|3875|idle|4ce50a7c.f23|2010-11-18 12:14:04 CET|67/0|0|postgres| LOG: unexpected EOF on client connection2010-11-18 12:30:02.841 CET|1|mirai|10.19.0.43(13943)|2275|idle|4ce508b4.8e3|2010-11-18 12:06:28 CET|75/0|0|postgres| LOG: unexpected EOF on client connection2010-11-18 12:30:47.628 CET|2|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:31:48.056 CET|3|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:32:48.184 CET|4|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:33:48.614 CET|5|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:34:48.747 CET|6|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:35:49.277 CET|7|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:36:49.706 CET|8|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:37:30.462 CET|1|spain|10.19.0.48(11276)|7414|idle|4ce50ffa.1cf6|2010-11-18 12:37:30 CET|2/0|0|spainadmin| LOG: unexpected EOF on client connectionAny Ideas?Thanks for your help and feedback!Pablo2010/11/16 Alvaro Herrera <alvherre@xxxxxxxxxxxxxxxxx>
Excerpts from Pablo Delgado Díaz-Pache's message of mar nov 16 12:18:09 -0300 2010:
> Not sure what you mean.Yeah. This means that the autovacuum launcher daemon is still running
>
> Once we start getting that error in the postgres log, there is no autovacuum
> entry. Only that error message once every minute (and as i wrote in my last
> email, the same message all the time).
and trying to schedule new jobs, but it's not able to for some reason.
When it tries to start a new one, it sees the old one still waiting to
start running; and then it logs this message.
To start a new job, what it does is send a signal to postmaster and wait
for it to create a child process. This is the part that's not working.
I have no idea what's going on. I have never seen this problem. If you
> The question is what is causing postgres to stop working. Autovacuum or
> another reason? If it's another reason ... what could it be?
see it happen again, please get a stacktrace from postmaster; and/or
strace it.
--
Álvaro Herrera <alvherre@xxxxxxxxxxxxxxxxx>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support