Re: Streaming replication connection break - unexpected EOF on standby connection

Ganesh Korde <ganeshakorde@xxxxxxxxx> · Tue, 17 Jul 2018 13:40:18 +0530

Hi,     
   Finally issue has been resolved and issue was with network. There were two issues as below.

1. 

We have two links between primary and secondary, VPN tunnel and 

L2TP.  Multiple routes were configured for VPN and  

L2TP

from the Secondary to                 primary. Tunnel and link was always up.  But, ss the source from Secondary is coming through tunnel  towards Primary the connection was getting dropped         after  reaching the destination due to ambiguity on routes between L2TP and VPN tunnel. This issue has been resolved by allowing access via VPN tunnel           removing  L2TP route.

2. After fixing the above, still disconnection was happening, but after specific time interval. It was due to certain negotiation time set IPSEC tunnel.
So they have enabled auto negotiation on the IPSEC tunnel so it won’t wait from tunnel initiation from other end.

So issues related disconnection has been resolved.

Thanks Johannes and Fabio for your help.

Regards,
Ganesh.

On Tue, Jul 3, 2018 at 5:37 PM Fabio Pardi <f.pardi@xxxxxxxxxxxx> wrote:

    Hi Ganesh,
    the logs you posted refer to timeouts in the connections.
    Your configuration tells that in case of network drop, the
      standby server will be the first to acknowledge it. That's because
      wal_receiver_timeout is < than wal_sender_timeout

    From the documentation:

      wal_receiver_timeout (integer) 

        Terminate replication connections that are inactive longer
          than
          the specified number of milliseconds. This is useful for the
          receiving standby server to detect a primary node crash or
          network
          outage. A value of zero disables the timeout mechanism. This
          parameter can only be set in the postgresql.conf
          file or on the server command
          line. The default value is 60 seconds.

    That might explain why your secondary server calls for a RST. 

    RST packages you posted are more a consequence, than a cause of your
    problem.

    I think that the RST is sent to acknowledge master that the
    connection should be closed due to timeout.

    What above goes together with the fact that the sequence number 1232664740 of the RST packet is
        retransmitted several times, meaning that it did not reach its
        destination at first.

        I would look more carefully to your network because I suspect
        the real problem might be there. 

        regards,

        fabio pardi

    On 03/07/18 09:36, Ganesh Korde wrote:

      Hi,

            After analysis by network team, they found packets are
          getting reset by Secondary server. Below are the logs.

          782.822280 port7 in
                <Secondary_server_IP>.35918 ->
                <Primary_server_IP>.5433: rst 1232664740
          782.822310 wan2 out
                <Secondary_server_IP>.35918 ->
                <Primary_server_IP>.5433: rst 1232664740
          782.822313 port7 in
                <Secondary_server_IP>.35918 ->
                <Primary_server_IP>.5433: rst 1232664740
          782.822315 wan2 out
                <Secondary_server_IP>.35918 ->
                <Primary_server_IP>.5433: rst 1232664740
          782.822317 port7 in
                <Secondary_server_IP>.35918 ->
                <Primary_server_IP>.5433: rst 1232664740
          782.822319 wan2 out
                <Secondary_server_IP>.35918 ->
                <Primary_server_IP>.5433: rst 1232664740
          782.822345 port7 in
                <Secondary_server_IP>.35918 ->
                <Primary_server_IP>.5433: rst 1232664740

        But they didn't able to find why secondary generating reset
          packet. There are no any devices between these servers which
          can modify the packets.
        Though both servers are on different firewall, but packets
          are getting reset at secondary server and not at the firewall
          level, we can see this in the log.

        Below points I would like to mention about application 
        1. This connection interruption happens in day time, when
          transactions are little bit high. In day time, average
          transactions per second are 5 (Inserts and processing). 
        2. We are not using connection pool, so each time request
          comes app server creates new connection to db server and when
          processing is done, app server disconnects. 

        We are now clue less why secondary server resetting the
          packets.  Any help is highly appreciated. 

        Thanks & Regards,
        Ganesh.

        On Thu, Jun 28, 2018 at 3:38 PM Ganesh Korde <ganeshakorde@xxxxxxxxx>
          wrote:

            Hi  Johannes,

              Thanks for your reply. We are using VPN Tunnel
              between these two hosts. I will check with network team,
              with remaining questions you mentioned and will get back.

              Thanks 
              & Regards,
            Ganesh.

            On Wed, Jun 27, 2018 at 6:46 PM Johannes
              Truschnigg <johannes@xxxxxxxxxxxxxxx>
              wrote:

            Hi
              Ganesh,

              On Wed, Jun 27, 2018 at 06:37:25PM +0530, Ganesh Korde
              wrote:

              > [...]

              > 1. Because of what reason, " unexpected EOF on
              standby connection" occurs

              > on primary db server?

              > 2. After replication disconnection, secondary should
              immediately connect to

              > primary, but it takes some time, what could be the
              reason for this?

              From skimming the log, it seems to me that there is an
              issue at the

              socket/network level, which yields the "connection reset
              by peer" eror

              message.

              What is the network between these two hosts like? Is it a
              WAN link; is a VPN

              or SSH tunnel involved? Do you have other, long-running
              TCP sessions between

              these peers, and do they experience similar or other
              problems? Do the hosts'

              link-layer stats hint at problems, e. g. packet loss? Do
              the hosts' kernels

              leave a message hinting at L2 connectivity problems in
              their debug ringbuffers

              (`dmesg`) at the time you observe the replication drop
              out?

              -- 

              with best regards:

              - Johannes Truschnigg ( johannes@xxxxxxxxxxxxxxx )

              www:   https://johannes.truschnigg.info/

              phone: +43 650 2 133337

              xmpp:  johannes@xxxxxxxxxxxxxxx

              Please do not bother me with HTML-email or attachments.
              Thank you.