Re: terminating walsender process due to replication timeout

Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> · Fri, 24 May 2019 09:23:50 +0300



    On 23/5/19 5:05 μ.μ.,
      AYahorau@xxxxxxxxxxx wrote:

    
      Hello Everyone!

      
      I can simplify and describe the
        issue
        I faced.

        I have 2 nodes in db cluster: master and standby.
      

      I create a simple table on master
        node
        by a command via psql:
      

      CREATE TABLE table1 (a INTEGER);
      

      After this I fill the table by
        COPY
        command from a file which  contains 2000000 (2 million) entries.
      

      And in case when I run for
        example such
        a command:

        UPDATE table1 SET a='1' 
      

      or such a command:
      

      DELETE FROM table1;
      

      I see in PostgreSQL log the an
        entry:
        terminating walsender process due to replication timeout.

        
        I suppose that this issue caused by small value of
        wal_sender_timeout=1s
        and long runtime of the queries (it takes about 15 seconds).
      

      What is the best way to proceed
        it?
        How to avoid this? Is there any additional configuration which
        can help
        here?
    I have set mine to 15min. No problems for over 7
      months, knock on wood.

    
      Regards, 
      

      Andrei 
      

      From:      
         Andrei Yahorau/IBA
      

      To:      
         Kyotaro HORIGUCHI
        <horiguchi.kyotaro@xxxxxxxxxxxxx>,
      
      
      Cc:      
         pgsql-general@xxxxxxxxxxxxxx,
        rene.romero.b@xxxxxxxxx
      

      Date:      
         17/05/2019 11:04
      

      Subject:    
           Re: terminating
        walsender process due to replication timeout
      

      Hello.

        
        Thanks for the answer.

        
        Can frequent database operations cause getting a standby server
        behind?
        Is there a way to avoid this situation?

        I checked that walsender works well in my test  if I set
        wal_sender_timeout
        at least to 5 second.
      

      Best regards, 
      

      Andrei Yahorau
      

      From:      
         Kyotaro HORIGUCHI
        <horiguchi.kyotaro@xxxxxxxxxxxxx>
      

      To:      
         AYahorau@xxxxxxxxxxx,
      
      
      Cc:      
         rene.romero.b@xxxxxxxxx,
        pgsql-general@xxxxxxxxxxxxxx
      

      Date:      
         16/05/2019 10:36
      

      Subject:    
           Re: terminating
        walsender process due to replication timeout
      

      Hello.

          
          At Wed, 15 May 2019 10:04:12 +0300, AYahorau@xxxxxxxxxxx wrote
          in
          <OF99D0D839.6A5BCB70-ON432583FB.0025912E-432583FB.0026D664@xxxxxx>

          > Hello,

          > Thank You for the response.

          > 

          > Yes that's possible to monitor replication delay. But my
          questions
          were 

          > not about monitoring network issues. 

          > 

          > I use exactly wal_sender_timeout=1s because it allows to
          detect 

          > replication problems quickly.

          
          Though I don't have an exact idea of your configuration, it
          seems

          to me that your standby is simply getting behind more than one

          second from the master. If you regard the fact as a problem of

          replication, the configuration can be said to be finding the

          problem correctly.

          
          Since the keep-alive packet is sent in-band, it doesn't get to

          the standby before already-sent-but-not-processed packets.

          
          > So, I need clarification to the following  questions:

          > Is  it possible to use exactly this configuration and be
          sure
          that it will 

          > be work properly.

          > What did I do wrong? Should I correct my configuration
          somehow?

          > Is this the same issue  as mentioned here: 

          > https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@xxxxxxxxxxxxxxx
          

          > ? If it is so, why I do I face this problem again?

          
          It is not the same "problem". What was mentioned there is fast

          network making the sender-side loop busy, which prevents

          keepalive packet from sending.

          
          regards.

          
          -- 

          Kyotaro Horiguchi

          NTT Open Source Software Center

          
    -- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt