Hi Alvaro!
On 23/11/18 1:10 μ.μ., Alvaro Herrera wrote:
On 2018-Nov-14, Rui DeSousa wrote:
On Nov 14, 2018, at 3:31 AM, Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> wrote:
Our sysadms (seasoned linux/network guys : we have been working here
for more than 10 yrs) were absolute in that we run no firewall or
other traffic shaping system between the two hosts. (if we did the
problem would manifest itself earlier). Can you recommend what to
look for exactly regarding both TCP stacks ? The subscriber node is
a clone of the primary. We have :
# sysctl -a | grep -i keepaliv
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
Those keep alive settings are linux’s defaults and work out to be 18
hours before the abandon connection is dropped. So, the WAL receiver
should have corrected itself after that time. For reference, I run
terminating abandon session within 15 mins as they take-up valuable
database resources and could potentially hold on to locks, snapshots,
etc.
Where does your 18h figure come from? As I understand it, these numbers
mean "wait 7200 seconds, then send 9 probes 75 seconds apart", kill the
connection if not reply is obtained. So that works out to about 131
minutes (modulo fencepost bug). Certainly not 18 hours ...
Thanks, yes it sums up to 2Hrs 11 Mins. The moments after the primary crushed I didn't have the nerves/patience/guts to wait that long and actually prove that the subscriber listened happily to a
ghost/stuck connection.
Now ... I have seen Linux kernel code that seemed to me to cause network
transmission get stuck *in the kernel* without any way out. Now I'm not
a kernel expert and I don't know if this applies to your case (maybe it
got fixed already), but it was definitely some process that was stuck
with "wchan" set to a network kernel call and way beyond TCP keepalives.
It seems we'll have to upgrade our systems/kernels ASAP. Thanks a lot!
--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt