On 2018-Nov-14, Rui DeSousa wrote: > > On Nov 14, 2018, at 3:31 AM, Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> wrote: > > > > Our sysadms (seasoned linux/network guys : we have been working here > > for more than 10 yrs) were absolute in that we run no firewall or > > other traffic shaping system between the two hosts. (if we did the > > problem would manifest itself earlier). Can you recommend what to > > look for exactly regarding both TCP stacks ? The subscriber node is > > a clone of the primary. We have : > > > > # sysctl -a | grep -i keepaliv > > net.ipv4.tcp_keepalive_intvl = 75 > > net.ipv4.tcp_keepalive_probes = 9 > > net.ipv4.tcp_keepalive_time = 7200 > > Those keep alive settings are linux’s defaults and work out to be 18 > hours before the abandon connection is dropped. So, the WAL receiver > should have corrected itself after that time. For reference, I run > terminating abandon session within 15 mins as they take-up valuable > database resources and could potentially hold on to locks, snapshots, > etc. Where does your 18h figure come from? As I understand it, these numbers mean "wait 7200 seconds, then send 9 probes 75 seconds apart", kill the connection if not reply is obtained. So that works out to about 131 minutes (modulo fencepost bug). Certainly not 18 hours ... Now ... I have seen Linux kernel code that seemed to me to cause network transmission get stuck *in the kernel* without any way out. Now I'm not a kernel expert and I don't know if this applies to your case (maybe it got fixed already), but it was definitely some process that was stuck with "wchan" set to a network kernel call and way beyond TCP keepalives. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services