Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC in pg_wal "No space left on device"

Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> · Sat, 17 Nov 2018 13:07:18 +0200



    On 16/11/18 5:29 μ.μ., Rui DeSousa
      wrote:

    
          On Nov 16, 2018, at 3:18 AM, Achilleas Mantzios
            <achill@xxxxxxxxxxxxxxxxxxxxx>
            wrote:
          

                net.inet.tcp.always_keepalive=1
              
              
              This setting is from FreeBSD. I have tested changing the
              settings on my PostgreSQL 11.1 on my FreeBSD
              11.2-RELEASE-p3, and this would have no effect at all to
              the postgresql settings, they remained all three of them
              at zero. This is completely irrelevant with my problem but
              anyway.

              
        That is what I stated; you don’t need it.  It is that in
          Linux the application has to enable it and I don’t know of a
          kernel setting for Linux like the one in FreeBSD
      
    
    You may read the PostgreSQL backend sources (grep for
      SO_KEEPALIVE), the code supports KEEPALIVE.

    
                A quick google and it looks like Linux
                  defaults to not enabling keep alive whereas FreeBSD
                  enables it by default and globally regardless of
                  application request.  For Linux, Postgres will need to
                  request it. You will need to setup the keep alive
                  parameters in the Postgres configuration and restart
                  the server.
              
              
              http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

              So according to the official Linux docs, three are the
              parameters that govern TCP keepalive in Linux, which in
              both the said systems are set as :

              root@TEST-smadb:/var/lib/pgsql# sysctl -a | grep keep

              net.ipv4.tcp_keepalive_intvl = 75

              net.ipv4.tcp_keepalive_probes = 9

              net.ipv4.tcp_keepalive_time = 7200

              root@TEST-smadb:/var/lib/pgsql# 

              
        That does not mean the connection has TCP keep alive
          enabled; it just means that if an application requests it
          those would be the defaults setting if it doesn’t provide its
          own.  Those setting would be too large anyway; you want to be
          able to detect a broken connection much quicker than 18 hours.
      
    
    I checked on a bare minimal default installation, (after tweaking
      the kernel tunables to smaller values of course), keepalive msgs
      are sent and ACK'ed at the specified intervals, checked with
      wireshark, port 5432. You should test this yourself.

    
                The keep alive setup will allow WAL
                  receiver to detect the broken connection resulting in
                  it terminating the current connection and attempt to
                  establish a new connection.
              
              
              So from looks of this, keep alive is enabled. (Also don't
              confuse WAL receiver with logical worker, different
              programs, albeit similar).

            
      I don’t believe it’s enabled; have you check to see that you
        getting keep alive packets?  If it was enabled it would have
        terminated after 18 hours.
    
    
    See above. In the meantime, I would be nice if someone from the
      hackers would chime in to clear things up, just to be sure.
    Which means, that since PostgreSQL *supports* KEEPALIVE and the
      logical worker kept happy like nothing happened, then I guess
      *something* was mocking the KEEPALIVE ACKs??????

    
        Is there any way
          (by network means?) to mock this behavior in order to fool the
          replication worker like the sender is there?

        
      Put a firewall in-between the servers and drop the
        packets without sending resets.
      

      Have a read here:
      

      Section 4.2
      

        http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/
      
      
      The RFC states TCP keep alive should be off by
        default; FreeBSD changed that back in 1999 and I believe Linux
        still follows the RFC:
      

      https://serverfault.com/questions/671710/why-does-freebsd-net-inet-tcp-always-keepalive-violate-rfc1122#671749