On 15/11/18 4:48 μ.μ., Rui DeSousa
wrote:
On 14/11/18 6:24 μ.μ., Rui DeSousa
wrote:
On Nov 14, 2018, at 3:31
AM, Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx>
wrote:
Our sysadms (seasoned linux/network guys : we have been
working here for more than 10 yrs) were absolute in that
we run no firewall or other traffic shaping system
between the two hosts. (if we did the problem would
manifest itself earlier). Can you recommend what to
look for exactly regarding both TCP stacks ? The
subscriber node is a clone of the primary. We have :
# sysctl -a | grep -i keepaliv
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
Those keep alive settings are linux’s defaults and work
out to be 18 hours before the abandon connection is
dropped. So, the WAL receiver should have corrected
itself after that time. For reference, I run terminating
abandon session within 15 mins as they take-up valuable
database resources and could potentially hold on to locks,
snapshots, etc.
Didn't happen. Friday 15:01 + '18
hrs' = Saturday 09:01 . The subscription was stuck on
Friday 15:01, unfortunately I didn't keep the whole output
of the pg_stat_subscription (which I took on Sunday
morning). And in Sunday morning the primary, having run
out of space, PANIC'ed .
Apparently TCP keep alive is not enabled; looks like
we’re missing the following but don’t look for it :). So the
connection would stick around until it gets a reset packet which
could be days if not longer especially if packets are being
dropped. I know you say there is no firewall; but the upstream
server would send a reset packet if connection was terminated
and/or Postgres was down — so there is something dropping
packets.
net.inet.tcp.always_keepalive=1
This setting is from FreeBSD. I have tested changing the settings on
my PostgreSQL 11.1 on my FreeBSD 11.2-RELEASE-p3, and this would
have no effect at all to the postgresql settings, they remained all
three of them at zero. This is completely irrelevant with my problem
but anyway.
A quick google and it looks like Linux defaults to
not enabling keep alive whereas FreeBSD enables it by default
and globally regardless of application request. For Linux,
Postgres will need to request it. You will need to setup the
keep alive parameters in the Postgres configuration and restart
the server.
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
So according to the official Linux docs, three are the parameters
that govern TCP keepalive in Linux, which in both the said systems
are set as :
root@TEST-smadb:/var/lib/pgsql# sysctl -a | grep keep
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
root@TEST-smadb:/var/lib/pgsql#
The keep alive setup will allow WAL receiver to
detect the broken connection resulting in it terminating the
current connection and attempt to establish a new connection.
So from looks of this, keep alive is enabled. (Also don't confuse
WAL receiver with logical worker, different programs, albeit
similar).
People any other theories as to why the subscriber kept acting like
the publisher was there, while it wasn't?
Is there any way (by network means?) to mock this behavior in order
to fool the replication worker like the sender is there?
--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt
|