Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC in pg_wal "No space left on device"

Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> · Wed, 14 Nov 2018 10:31:16 +0200



    On 14/11/18 3:52 π.μ., Rui DeSousa
      wrote:

    
          On Nov 13, 2018, at 11:06 AM, Achilleas Mantzios
            <achill@xxxxxxxxxxxxxxxxxxxxx>
            wrote:
          

              On 13/11/18 5:35 μ.μ., Rui
                DeSousa wrote:

              
                    On Nov 13, 2018, at 7:00 AM, Achilleas
                      Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx>
                      wrote:
                    

                      Is there a way
                        for the WAL receiver to not have detected the
                        termination of the replication stream?

                    
                The teardown of the network socket on the
                  upstream server should send a reset packet to the
                  downstream server and at that point the WAL receiver
                  would close its connection.  Is there any firewalls,
                  router, rules, etc between the nodes that could have
                  dropped the packet? 

                
              No

              
        If both nodes are up and able to communicate with each
          other then the OS should teardown the connection; why is that
          not happening?  You stated that you’re in the cloud; cloud
          providers use a software network and all endpoints have ACLs —
          a.k.a firewall. I would also check to see if iptables is
          running… it on by default.
        

        Have there been changes to how you deploy new ACLs, etc? 
        

        The fact that WAL sender terminated and WAL receiver did
          not; points to the WAL receiver not being able to communicate
          with the upstream server.
        

                    Shouldn't normally the
                      WAL receiver detect this and try again in
                      wal_retrieve_retry_interval ?
                  

                Not really… if the connection has already
                  been torn down; the upstream server would send another
                  reset packet on the next request and in this case it
                  would.  However, if request packets at not reaching
                  the upstream server; i.e. due to firewall silently
                  dropping the packets (personally I believe firewall
                  should always set reset packets to friendly hosts)
                  then what happens is the TCP/IP send queue builds up
                  with the requests packets instead — a t this point
                  waiting on the OS to terminate the connection which
                  can day or two depending on your TCP/IP setting.
                

              Again no dropping, no firewall.

              
        Again, if both nodes are up and are able to communicate
          then the this should get resolved on its own.
        

                What you want to use instead is
                  wal_receiver_timeout to detect the given case where
                  upstream server either no longer exists or the
                  firewall, etc is silently dropping packets.
              
              
              Once again from my original message :

              "while setting up logical replication since August we had
              seen early on the need to increase max_receiver_timeout
              and max_sender_timeout from 60sec to 5mins"

              
              So with wal_receiver_timeout='5 min', the receiver never
              detected any timeout.

              
        It should have reached the timeout and the connection
          should be torn down.  It could be that the send queue is
          backed up and Postgres is hung on trying to teardown the
          connection.  Again, network related issue.
        

        I’ve actually ran into this issue before; where I was
          unable to terminate an idle session from an application node
          that had an open transaction.  The TCP/IP send queue was
          backlogged thus it would not terminate; one might consider
          this a bug or not.  The application server was not reachable
          due to firewall dropping the connection and not sending a
          reset packet on further attempts to communicate with the
          application server; I had to get a network admin to drop the
          connection as bouncing Postgres was not an option on a
          production system and kill -9 is a bad idea too.  
        

      Assuming this is going to happen again — I would
        advise you to get from both nodes the state of connection and
        tcp/ip queue information from netstat.   If you can also to a
        tcpdump on the connection to see what each of the nodes is doing
        that would give you more insight.
      

      I would also advise looking into TCP/IP keep alive.
    
    
    Our sysadms (seasoned linux/network guys : we have been working here
    for more than 10 yrs) were absolute in that we run no firewall or
    other traffic shaping system between the two hosts. (if we did the
    problem would manifest itself earlier).  Can you recommend what to
    look for exactly regarding both TCP stacks ? The subscriber node is
    a clone of the primary. We have :

    
    # sysctl -a | grep -i keepaliv

    net.ipv4.tcp_keepalive_intvl = 75

    net.ipv4.tcp_keepalive_probes = 9

    net.ipv4.tcp_keepalive_time = 7200

    
    dynacom=# show tcp_keepalives_count ;

     tcp_keepalives_count 

    ----------------------

     0

    (1 row)

    
    dynacom=# show tcp_keepalives_idle ;

     tcp_keepalives_idle 

    ---------------------

     0

    (1 row)

    
    dynacom=# show tcp_keepalives_interval ;

     tcp_keepalives_interval 

    -------------------------

     0

    (1 row)

    
    Also in addition to what you say (netstat, tcpdump) if I detect such
    a case (even with the primary panic'ed -- yeah this would take quite
    some nerves to do this) I will connect with gdb and take a stack
    trace to know what the worker is doing and why doesn't it restart.

    
    -- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt