On 16/11/18 5:29 μ.μ., Rui DeSousa
wrote:
net.inet.tcp.always_keepalive=1
This setting is from FreeBSD. I have tested changing the
settings on my PostgreSQL 11.1 on my FreeBSD
11.2-RELEASE-p3, and this would have no effect at all to
the postgresql settings, they remained all three of them
at zero. This is completely irrelevant with my problem but
anyway.
That is what I stated; you don’t need it. It is that in
Linux the application has to enable it and I don’t know of a
kernel setting for Linux like the one in FreeBSD
You may read the PostgreSQL backend sources (grep for
SO_KEEPALIVE), the code supports KEEPALIVE.
A quick google and it looks like Linux
defaults to not enabling keep alive whereas FreeBSD
enables it by default and globally regardless of
application request. For Linux, Postgres will need to
request it. You will need to setup the keep alive
parameters in the Postgres configuration and restart
the server.
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
So according to the official Linux docs, three are the
parameters that govern TCP keepalive in Linux, which in
both the said systems are set as :
root@TEST-smadb:/var/lib/pgsql# sysctl -a | grep keep
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
root@TEST-smadb:/var/lib/pgsql#
That does not mean the connection has TCP keep alive
enabled; it just means that if an application requests it
those would be the defaults setting if it doesn’t provide its
own. Those setting would be too large anyway; you want to be
able to detect a broken connection much quicker than 18 hours.
I checked on a bare minimal default installation, (after tweaking
the kernel tunables to smaller values of course), keepalive msgs
are sent and ACK'ed at the specified intervals, checked with
wireshark, port 5432. You should test this yourself.
The keep alive setup will allow WAL
receiver to detect the broken connection resulting in
it terminating the current connection and attempt to
establish a new connection.
So from looks of this, keep alive is enabled. (Also don't
confuse WAL receiver with logical worker, different
programs, albeit similar).
I don’t believe it’s enabled; have you check to see that you
getting keep alive packets? If it was enabled it would have
terminated after 18 hours.
See above. In the meantime, I would be nice if someone from the
hackers would chime in to clear things up, just to be sure.
Which means, that since PostgreSQL *supports* KEEPALIVE and the
logical worker kept happy like nothing happened, then I guess
*something* was mocking the KEEPALIVE ACKs??????
Is there any way
(by network means?) to mock this behavior in order to fool the
replication worker like the sender is there?
Put a firewall in-between the servers and drop the
packets without sending resets.
Have a read here:
Section 4.2
The RFC states TCP keep alive should be off by
default; FreeBSD changed that back in 1999 and I believe Linux
still follows the RFC: