Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC in pg_wal "No space left on device"

Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> · Sun, 18 Nov 2018 09:03:05 +0200



    On 17/11/18 10:47 μ.μ., Rui DeSousa
      wrote:

    
          On Nov 17,
              2018, at 6:07 AM, Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx>
              wrote:
          

              You may read
                  the PostgreSQL backend sources (grep for
                  SO_KEEPALIVE), the code supports KEEPALIVE.
              

        Postgres supports it; but
            the question is it on for the given connection? 

          
    As we both saw this is the default.

    
              I checked on
                  a bare minimal default installation, (after tweaking
                  the kernel tunables to smaller values of course),
                  keepalive msgs are sent and ACK'ed at the specified
                  intervals, checked with wireshark, port 5432. You
                  should test this yourself.

                
      I just configured
          Postgres with streaming replication using the following
          versions and TCP keep alive was enabled by default for the WAL
          receiver connection and also psql connections.
      

        Linux debian
            4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13)
            x86_64 GNU/Linux
        PostgreSQL 10.6
            (Debian 10.6-1.pgdg90+1) on x86_64-pc-linux-gnu, compiled by
            gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit
      
      
        root@debian:~#
            netstat -anp --timers | grep -e Timer -e  EST | grep -e
            Timer -e 5432
        Proto Recv-Q
            Send-Q Local Address           Foreign Address         State
                  PID/Program name     Timer
        tcp        0    
             0 10.6.6.101:47546        10.6.6.100:5432        
            ESTABLISHED 989/telnet           off (0.00/0/0)
        tcp        0    
             0 10.6.6.101:47544        10.6.6.100:5432        
            ESTABLISHED 953/psql             keepalive (7103.36/0/0)
        tcp        0    
             0 10.6.6.101:47542        10.6.6.100:5432        
            ESTABLISHED 922/postgres: 10/ma  keepalive (7088.03/0/0)
      
      
      As you can see
          from above; telnet does not enable keep alive on the
          connection.  I would check the troubled system with the above
          netstat command to verify that keep alive is in fact enabled
          on the WAL receiver connection.
      

    On the very same system (with *no* configuration/nothing
        changed -the kernel sysctl keepalive tested settings was on
        another test machine-), just now :
    root@TEST-smadb:/var/lib/pgsql# uname -a

      Linux TEST-smadb 3.16.0-4-amd64 #1 SMP Debian
        3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux

      root@TEST-smadb:/var/lib/pgsql# 

      root@TEST-smadb:/var/lib/pgsql# netstat -anp --timers |
        grep 10.9.0.20 | grep 5432

      tcp        0      0 10.9.0.77:51307        
        10.9.0.20:5432          ESTABLISHED 34400/postgres: bgw
        keepalive (5711.60/0/0)

      root@TEST-smadb:/var/lib/pgsql# 

      root@TEST-smadb:/var/lib/pgsql# ps aux | grep 34400

      postgres  34400  0.5 17.1 19618680 5648984 ?    Ssl 
        Nov16  13:52 postgres: bgworker: logical replication worker for
        subscription 185231525  

      root      70384  0.0  0.0  12708  1620 pts/1    S+  
        08:29   0:00 grep 34400

      root@TEST-smadb:/var/lib/pgsql# 

      root@TEST-smadb:/var/lib/pgsql# su - postgres -c 'psql -c
        "select version()"'

                                              
        version                                         

      -----------------------------------------------------------------------------------------

       PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc
        (Debian 4.9.2-10) 4.9.2, 64-bit

      (1 row)

      
      root@TEST-smadb:/var/lib/pgsql#

      
      If it’s enabled
          the connection should have terminated after the 18 hours

        
    Yes.

    
      and hopefully less
          now with your new setting.  

        
    This was on a  test node, the settings are still at the default.

    
      I have no idea why
          it wouldn’t terminate and reconnect other than tcp keep live
          is either off or a bug in Linux/Postgres.
    
    
    I may post the logs of both the primary and the subscriber
      (filtering out some sensitive business content of course). It is
      striking that the subscriber node immediately after restart,
      starts asking for WALs, so this is not network firewall related,
      at least not in a static fashion. I mean if a firewall was in
      between with some rules then it could have (catastrophic) effect
      for the whole course of replication, not just those specific
      hours. So I too think that there maybe a bug in PostgreSQL (but
      the logical worker shares code with the wal receiver, this has
      been tested for so long) or the OS (again unlikely) *or* something
      weird may have happened, like some VM migration, e.g. cloning, on
      the cloud provider side. I don't have direct contact with them,
      our admin guys say nothing abnormal happened, the cloud provider
      says nothing abnormal happened, however my boss told me that they
      were doing some migrations from one site to another. That might
      explain some things.
    I will ask again. Thank you a lot!