On 17/11/18 10:47 μ.μ., Rui DeSousa
wrote:
You may read
the PostgreSQL backend sources (grep for
SO_KEEPALIVE), the code supports KEEPALIVE.
Postgres supports it; but
the question is it on for the given connection?
As we both saw this is the default.
I checked on
a bare minimal default installation, (after tweaking
the kernel tunables to smaller values of course),
keepalive msgs are sent and ACK'ed at the specified
intervals, checked with wireshark, port 5432. You
should test this yourself.
I just configured
Postgres with streaming replication using the following
versions and TCP keep alive was enabled by default for the WAL
receiver connection and also psql connections.
Linux debian
4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13)
x86_64 GNU/Linux
PostgreSQL 10.6
(Debian 10.6-1.pgdg90+1) on x86_64-pc-linux-gnu, compiled by
gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit
root@debian:~#
netstat -anp --timers | grep -e Timer -e EST | grep -e
Timer -e 5432
Proto Recv-Q
Send-Q Local Address Foreign Address State
PID/Program name Timer
tcp 0
0 10.6.6.101:47546 10.6.6.100:5432
ESTABLISHED 989/telnet off (0.00/0/0)
tcp 0
0 10.6.6.101:47544 10.6.6.100:5432
ESTABLISHED 953/psql keepalive (7103.36/0/0)
tcp 0
0 10.6.6.101:47542 10.6.6.100:5432
ESTABLISHED 922/postgres: 10/ma keepalive (7088.03/0/0)
As you can see
from above; telnet does not enable keep alive on the
connection. I would check the troubled system with the above
netstat command to verify that keep alive is in fact enabled
on the WAL receiver connection.
On the very same system (with *no* configuration/nothing
changed -the kernel sysctl keepalive tested settings was on
another test machine-), just now :
root@TEST-smadb:/var/lib/pgsql# uname -a
Linux TEST-smadb 3.16.0-4-amd64 #1 SMP Debian
3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux
root@TEST-smadb:/var/lib/pgsql#
root@TEST-smadb:/var/lib/pgsql# netstat -anp --timers |
grep 10.9.0.20 | grep 5432
tcp 0 0 10.9.0.77:51307
10.9.0.20:5432 ESTABLISHED 34400/postgres: bgw
keepalive (5711.60/0/0)
root@TEST-smadb:/var/lib/pgsql#
root@TEST-smadb:/var/lib/pgsql# ps aux | grep 34400
postgres 34400 0.5 17.1 19618680 5648984 ? Ssl
Nov16 13:52 postgres: bgworker: logical replication worker for
subscription 185231525
root 70384 0.0 0.0 12708 1620 pts/1 S+
08:29 0:00 grep 34400
root@TEST-smadb:/var/lib/pgsql#
root@TEST-smadb:/var/lib/pgsql# su - postgres -c 'psql -c
"select version()"'
version
-----------------------------------------------------------------------------------------
PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc
(Debian 4.9.2-10) 4.9.2, 64-bit
(1 row)
root@TEST-smadb:/var/lib/pgsql#
If it’s enabled
the connection should have terminated after the 18 hours
Yes.
and hopefully less
now with your new setting.
This was on a test node, the settings are still at the default.
I have no idea why
it wouldn’t terminate and reconnect other than tcp keep live
is either off or a bug in Linux/Postgres.
I may post the logs of both the primary and the subscriber
(filtering out some sensitive business content of course). It is
striking that the subscriber node immediately after restart,
starts asking for WALs, so this is not network firewall related,
at least not in a static fashion. I mean if a firewall was in
between with some rules then it could have (catastrophic) effect
for the whole course of replication, not just those specific
hours. So I too think that there maybe a bug in PostgreSQL (but
the logical worker shares code with the wal receiver, this has
been tested for so long) or the OS (again unlikely) *or* something
weird may have happened, like some VM migration, e.g. cloning, on
the cloud provider side. I don't have direct contact with them,
our admin guys say nothing abnormal happened, the cloud provider
says nothing abnormal happened, however my boss told me that they
were doing some migrations from one site to another. That might
explain some things.
I will ask again. Thank you a lot!