Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC in pg_wal "No space left on device"

Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> · Fri, 16 Nov 2018 10:18:07 +0200



    On 15/11/18 4:48 μ.μ., Rui DeSousa
      wrote:

    
          On Nov 15, 2018, at 2:20 AM, Achilleas Mantzios
            <achill@xxxxxxxxxxxxxxxxxxxxx>
            wrote:
          

          On 14/11/18 6:24 μ.μ., Rui DeSousa
              wrote:

            
              On Nov 14, 2018, at 3:31
                AM, Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx>
                wrote:

                
                Our sysadms (seasoned linux/network guys : we have been
                working here for more than 10 yrs) were absolute in that
                we run no firewall or other traffic shaping system
                between the two hosts. (if we did the problem would
                manifest itself earlier).  Can you recommend what to
                look for exactly regarding both TCP stacks ? The
                subscriber node is a clone of the primary. We have :

                
                # sysctl -a | grep -i keepaliv

                net.ipv4.tcp_keepalive_intvl = 75

                net.ipv4.tcp_keepalive_probes = 9

                net.ipv4.tcp_keepalive_time = 7200

                
              Those keep alive settings are linux’s defaults and work
              out to be 18 hours before the abandon connection is
              dropped.  So, the WAL receiver should have corrected
              itself after that time.  For reference, I run terminating
              abandon session within 15 mins as they take-up valuable
              database resources and could potentially hold on to locks,
              snapshots, etc.

            
            Didn't happen. Friday 15:01 + '18
              hrs' = Saturday 09:01 . The subscription was stuck on
              Friday 15:01, unfortunately I didn't keep the whole output
              of the pg_stat_subscription (which I took on Sunday
              morning). And in Sunday morning the primary, having run
              out of space, PANIC'ed .

          
      Apparently TCP keep alive is not enabled; looks like
        we’re missing the following but don’t look for it :).  So the
        connection would stick around until it gets a reset packet which
        could be days if not longer especially if packets are being
        dropped.  I know you say there is no firewall; but the upstream
        server would send a reset packet if connection was terminated
        and/or Postgres was down — so there is something dropping
        packets.
      

      net.inet.tcp.always_keepalive=1
    
    
    This setting is from FreeBSD. I have tested changing the settings on
    my PostgreSQL 11.1 on my FreeBSD 11.2-RELEASE-p3, and this would
    have no effect at all to the postgresql settings, they remained all
    three of them at zero. This is completely irrelevant with my problem
    but anyway.

    
      A quick google and it looks like Linux defaults to
        not enabling keep alive whereas FreeBSD enables it by default
        and globally regardless of application request.  For Linux,
        Postgres will need to request it. You will need to setup the
        keep alive parameters in the Postgres configuration and restart
        the server.
    
    
    http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

    So according to the official Linux docs, three are the parameters
    that govern TCP keepalive in Linux, which in both the said systems
    are set as :

    root@TEST-smadb:/var/lib/pgsql# sysctl -a | grep keep

    net.ipv4.tcp_keepalive_intvl = 75

    net.ipv4.tcp_keepalive_probes = 9

    net.ipv4.tcp_keepalive_time = 7200

    root@TEST-smadb:/var/lib/pgsql# 

    
      The keep alive setup will allow WAL receiver to
        detect the broken connection resulting in it terminating the
        current connection and attempt to establish a new connection.
    
    
    So from looks of this, keep alive is enabled. (Also don't confuse
    WAL receiver with logical worker, different programs, albeit
    similar).

    
    People any other theories as to why the subscriber kept acting like
    the publisher was there, while it wasn't?

    Is there any way (by network means?) to mock this behavior in order
    to fool the replication worker like the sender is there?

    
    -- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt