Search Postgresql Archives

Re: Fwd: libpq: indefinite block on poll during network problems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Guys, first of all: thank you for you help and cooperation. I have received several mails suggesting tweaks for tcp_keepalive and usage of postgresql server functions/features (cancel, statement timeout), but as I said - it won't help.

I have reproduced the problem scenario. Logs are attached. I walk you through.

== Setup ==
Client and server applications are placed on separate hosts. Client = 192.168.15.4, Server = 192.168.15.7. Both are in local net. Both are synchronized using 3rd party NTP server. Lets look in strace_export.txt - top 8 lines = socket setup. Keepalive option is set. Client's OS keepalive parameters:

[root@krr2srv1wsn1 dtp_generator]# sysctl -a | grep keepalive
net.ipv4.tcp_keepalive_intvl = 5
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 10

This means that after 10 seconds of idle connection first TCP Keep-Alive probe is sent. If 3 probes with 5 second interval fail - connection should be considered dead.

Server configuration is in postgresql.conf.

== Part 1. TCP Keep Alive ==
At 11:25:35.847138 connection to the server is made and the first query is sent. Got response fast at 11:25:35.858582. No other queries were made for the next minute to catch keep alive packets. Wireshark 1.8.2 marks 13 - 36 frames as Keep-Alive, so we can see that it's configured right and definitely works.

== Part 2. The Problem ==
At 11:26:40.933017 queries generation is started on client side. Client is configured to perform 1 request per second. After some arbitrary time next command is executed on server node:
[root@cluster1]# date && iptables -A OUTPUT -p tcp --sport 5432 -j DROP && iptables -A INPUT -p tcp --dport 5432 -j DROP

11:26:47 is outputed to console. As you can see in client trace file, this time corresponds to frame 55 - the last query is made. strace shows send && poll syscalls. And... that's it. The client got blocked on poll.

== Part 3. The aftermath ==
The Client was blocked ~2 minutes. I killed application with SIGTERM, which you can see in strace. At the time application was still waiting on libpq's poll. The Pcap file show no trace of keep-alive packets after server was isolated with iptable's rules. As I said earlier: TCP Keep-Alive is done on idle connection only. When TCP retransmission kicks-in - TCP Keep-Alive is not performed.


Let me repeat myself again: the problem is NOT with the server. The problem is with libpq's PGgetResult which ultimately leads to very optimistic poll routine.

Thank you.

With regards, Dmitry Samonenko.


11:25:35.843597 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 12
11:25:35.843629 setsockopt(12, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
11:25:35.843656 setsockopt(12, SOL_TCP, TCP_NODELAY, [1], 4) = 0
11:25:35.843691 fcntl64(12, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
11:25:35.843713 connect(12, {sa_family=AF_INET, sin_port=htons(5432), sin_addr=inet_addr("192.168.15.7")}, 16) = -1 EINPROGRESS (Operation now in progress)
11:25:35.843901 poll([{fd=12, events=POLLOUT|POLLERR, revents=POLLOUT}], 1, 2000) = 1
11:25:35.846760 getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
11:25:35.846821 getsockname(12, {sa_family=AF_INET, sin_port=htons(50796), sin_addr=inet_addr("192.168.15.4")}, [16]) = 0
11:25:35.847138 send(12, "\0\0\0\10\4\322\26/", 8, 0) = 8
11:25:35.847285 poll([{fd=12, events=POLLIN|POLLERR, revents=POLLIN}], 1, 2000) = 1
11:25:35.848669 recv(12, "N", 1, 0)     = 1
11:25:35.848720 poll([{fd=12, events=POLLOUT|POLLERR, revents=POLLOUT}], 1, 2000) = 1
11:25:35.848799 send(12, "\0\0\0$\0\3\0\0user\0teligent\0database\0mnp\0\0", 36, 0) = 36
11:25:35.848982 poll([{fd=12, events=POLLIN|POLLERR, revents=POLLIN}], 1, 2000) = 1
11:25:35.851547 recv(12, "R\0\0\0\10\0\0\0\0S\0\0\0\26application_name\0\0S\0\0\0\31client_encoding\0UTF8\0S\0\0\0\27DateStyle\0ISO, MDY\0S\0\0\0\31integer_datetimes\0on\0S\0\0\0\33IntervalStyle\0postgres\0S\0\0\0\24is_superuser\0on\0S\0\0\0\31server_encoding\0UTF8\0S\0\0\0\31server_version\0009.2.4\0S\0\0\0#session_authorization\0teligent\0S\0\0\0#standard_conforming_strings\0on\0S\0\0\0\22TimeZone\0W-SU\0K\0\0\0\f\0\0w\30i\222\303\265Z\0\0\0\5I", 16384, 0) = 319
11:25:35.851989 send(12, "Q\0\0\0;select route, edr_tag from p90_get_route(\'9282487100\')\0", 60, 0) = 60
11:25:35.852145 poll([{fd=12, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
11:25:35.858582 recv(12, "T\0\0\0008\0\2route\0\0\0\0\0\0\0\0\0\4\23\377\377\377\377\377\377\0\0edr_tag\0\0\0\0\0\0\0\0\0\4\23\377\377\377\377\377\377\0\0D\0\0\0\27\0\2\0\0\0\5D2502\0\0\0\4P-N0C\0\0\0\rSELECT 1\0Z\0\0\0\5I", 16384, 0) = 101
11:26:40.933017 send(12, "Q\0\0\0;select route, edr_tag from p90_get_route(\'9282487100\')\0", 60, 0) = 60
11:26:40.934888 poll([{fd=12, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
11:26:40.936551 recv(12, "T\0\0\0008\0\2route\0\0\0\0\0\0\0\0\0\4\23\377\377\377\377\377\377\0\0edr_tag\0\0\0\0\0\0\0\0\0\4\23\377\377\377\377\377\377\0\0D\0\0\0\27\0\2\0\0\0\5D2502\0\0\0\4P-N0C\0\0\0\rSELECT 1\0Z\0\0\0\5I", 16384, 0) = 101
11:26:41.996206 send(12, "Q\0\0\0;select route, edr_tag from p90_get_route(\'9282487101\')\0", 60, 0) = 60
11:26:41.996357 poll([{fd=12, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
11:26:41.998455 recv(12, "T\0\0\0008\0\2route\0\0\0\0\0\0\0\0\0\4\23\377\377\377\377\377\377\0\0edr_tag\0\0\0\0\0\0\0\0\0\4\23\377\377\377\377\377\377\0\0D\0\0\0\27\0\2\0\0\0\5D2502\0\0\0\4P-N0C\0\0\0\rSELECT 1\0Z\0\0\0\5I", 16384, 0) = 101
11:26:43.037868 send(12, "Q\0\0\0;select route, edr_tag from p90_get_route(\'9282487102\')\0", 60, 0) = 60
11:26:43.038053 poll([{fd=12, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
11:26:43.040409 recv(12, "T\0\0\0008\0\2route\0\0\0\0\0\0\0\0\0\4\23\377\377\377\377\377\377\0\0edr_tag\0\0\0\0\0\0\0\0\0\4\23\377\377\377\377\377\377\0\0D\0\0\0\27\0\2\0\0\0\5D2502\0\0\0\4P-N0C\0\0\0\rSELECT 1\0Z\0\0\0\5I", 16384, 0) = 101
11:26:44.196052 send(12, "Q\0\0\0;select route, edr_tag from p90_get_route(\'9282487103\')\0", 60, 0) = 60
11:26:44.196267 poll([{fd=12, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
11:26:44.200684 recv(12, "T\0\0\0008\0\2route\0\0\0\0\0\0\0\0\0\4\23\377\377\377\377\377\377\0\0edr_tag\0\0\0\0\0\0\0\0\0\4\23\377\377\377\377\377\377\0\0D\0\0\0\27\0\2\0\0\0\5D2502\0\0\0\4P-N0C\0\0\0\rSELECT 1\0Z\0\0\0\5I", 16384, 0) = 101
11:26:45.248926 send(12, "Q\0\0\0;select route, edr_tag from p90_get_route(\'9282487104\')\0", 60, 0) = 60
11:26:45.249098 poll([{fd=12, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
11:26:45.263638 recv(12, "T\0\0\0008\0\2route\0\0\0\0\0\0\0\0\0\4\23\377\377\377\377\377\377\0\0edr_tag\0\0\0\0\0\0\0\0\0\4\23\377\377\377\377\377\377\0\0D\0\0\0\27\0\2\0\0\0\5D2502\0\0\0\4P-N0C\0\0\0\rSELECT 1\0Z\0\0\0\5I", 16384, 0) = 101
11:26:46.289267 send(12, "Q\0\0\0;select route, edr_tag from p90_get_route(\'9282487105\')\0", 60, 0) = 60
11:26:46.289489 poll([{fd=12, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
11:26:46.291821 recv(12, "T\0\0\0008\0\2route\0\0\0\0\0\0\0\0\0\4\23\377\377\377\377\377\377\0\0edr_tag\0\0\0\0\0\0\0\0\0\4\23\377\377\377\377\377\377\0\0D\0\0\0\27\0\2\0\0\0\5D2502\0\0\0\4P-N0C\0\0\0\rSELECT 1\0Z\0\0\0\5I", 16384, 0) = 101
11:26:47.510810 send(12, "Q\0\0\0;select route, edr_tag from p90_get_route(\'9282487106\')\0", 60, 0) = 60
11:26:47.510990 poll([{fd=12, events=POLLIN|POLLERR}], 1, -1) = -1 EINTR (Interrupted system call)
11:28:41.326314 +++ killed by SIGTERM +++

Attachment: client.pcap
Description: application/extension-pcap

Attachment: server.pcap
Description: application/extension-pcap

Attachment: postgresql.conf
Description: Binary data


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux