Re: Streaming replication connection break - unexpected EOF on standby connection

Fabio Pardi <f.pardi@xxxxxxxxxxxx> · Tue, 17 Jul 2018 10:14:32 +0200

Yeah, well done!

Thanks for letting us know and glad my tips helped!

regards,

fabio pardi

On 07/17/2018 10:10 AM, Ganesh Korde wrote:
> Hi,
>      
>    Finally issue has been resolved and issue was with network. There
> were two issues as below.
> 
> 1. We have two links between primary and secondary, VPN tunnel and 
> L2TP.  Multiple routes were configured for VPN and  L2TP from the
> Secondary to                 primary. Tunnel and link was always up. 
> But, ss the source from Secondary is coming through tunnel  towards
> Primary the connection was getting dropped         after  reaching the
> destination due to ambiguity on routes between L2TP and VPN tunnel. This
> issue has been resolved by allowing access via VPN tunnel         
>  removing  L2TP route.
> 
> 2. After fixing the above, still disconnection was happening, but after
> specific time interval. It was due to certain negotiation time set IPSEC
> tunnel.
> So they have enabled auto negotiation on the IPSEC tunnel so it won’t
> wait from tunnel initiation from other end.
> 
> So issues related disconnection has been resolved.
> 
> Thanks Johannes and Fabio for your help.
> 
> Regards,
> Ganesh.
> 
> On Tue, Jul 3, 2018 at 5:37 PM Fabio Pardi <f.pardi@xxxxxxxxxxxx
> <mailto:f.pardi@xxxxxxxxxxxx>> wrote:
> 
>     Hi Ganesh,
> 
>     the logs you posted refer to timeouts in the connections.
> 
>     Your configuration tells that in case of network drop, the standby
>     server will be the first to acknowledge it. That's because
>     wal_receiver_timeout is < than wal_sender_timeout
> 
>     From the documentation:
> 
>     |wal_receiver_timeout| (|integer|)
> 
>         Terminate replication connections that are inactive longer than
>         the specified number of milliseconds. This is useful for the
>         receiving standby server to detect a primary node crash or
>         network outage. A value of zero disables the timeout mechanism.
>         This parameter can only be set in the |postgresql.conf| file or
>         on the server command line. The default value is 60 seconds.
> 
> 
>     That might explain why your secondary server calls for a RST.
> 
>     RST packages you posted are more a consequence, than a cause of your
>     problem.
> 
>     I think that the RST is sent to acknowledge master that the
>     connection should be closed due to timeout.
> 
>     What above goes together with the fact that the sequence number
>     1232664740 of the RST packet is retransmitted several times, meaning
>     that it did not reach its destination at first.
> 
>     I would look more carefully to your network because I suspect the
>     real problem might be there.
> 
> 
>     regards,
> 
>     fabio pardi
> 
> 
> 
> 
>     On 03/07/18 09:36, Ganesh Korde wrote:
>>     Hi,
>>
>>         After analysis by network team, they found packets are getting
>>     reset by Secondary server. Below are the logs.
>>
>>     782.822280 port7 in <Secondary_server_IP>.35918 ->
>>     <Primary_server_IP>.5433: rst 1232664740
>>
>>     782.822310 wan2 out <Secondary_server_IP>.35918 ->
>>     <Primary_server_IP>.5433: rst 1232664740
>>
>>     782.822313 port7 in <Secondary_server_IP>.35918 ->
>>     <Primary_server_IP>.5433: rst 1232664740
>>
>>     782.822315 wan2 out <Secondary_server_IP>.35918 ->
>>     <Primary_server_IP>.5433: rst 1232664740
>>
>>     782.822317 port7 in <Secondary_server_IP>.35918 ->
>>     <Primary_server_IP>.5433: rst 1232664740
>>
>>     782.822319 wan2 out <Secondary_server_IP>.35918 ->
>>     <Primary_server_IP>.5433: rst 1232664740
>>
>>     782.822345 port7 in <Secondary_server_IP>.35918 ->
>>     <Primary_server_IP>.5433: rst 1232664740
>>
>>
>>     But they didn't able to find why secondary generating reset
>>     packet. There are no any devices between these servers which can
>>     modify the packets.
>>     Though both servers are on different firewall, but packets are
>>     getting reset at secondary server and not at the firewall level,
>>     we can see this in the log.
>>
>>     Below points I would like to mention about application 
>>     1. This connection interruption happens in day time, when
>>     transactions are little bit high. In day time, average
>>     transactions per second are 5 (Inserts and processing). 
>>     2. We are not using connection pool, so each time request comes
>>     app server creates new connection to db server and when processing
>>     is done, app server disconnects. 
>>
>>     We are now clue less why secondary server resetting the packets. 
>>     Any help is highly appreciated. 
>>
>>     Thanks & Regards,
>>     Ganesh.
>>
>>
>>
>>
>>     On Thu, Jun 28, 2018 at 3:38 PM Ganesh Korde
>>     <ganeshakorde@xxxxxxxxx <mailto:ganeshakorde@xxxxxxxxx>> wrote:
>>
>>         Hi  Johannes,
>>
>>           Thanks for your reply. We are using VPN Tunnel between these
>>         two hosts. I will check with network team, with remaining
>>         questions you mentioned and will get back.
>>
>>         Thanks  & Regards,
>>         Ganesh.
>>
>>         On Wed, Jun 27, 2018 at 6:46 PM Johannes Truschnigg
>>         <johannes@xxxxxxxxxxxxxxx <mailto:johannes@xxxxxxxxxxxxxxx>>
>>         wrote:
>>
>>             Hi Ganesh,
>>
>>
>>             On Wed, Jun 27, 2018 at 06:37:25PM +0530, Ganesh Korde wrote:
>>             > [...]
>>             > 1. Because of what reason, " unexpected EOF on standby
>>             connection" occurs
>>             > on primary db server?
>>             > 2. After replication disconnection, secondary should
>>             immediately connect to
>>             > primary, but it takes some time, what could be the
>>             reason for this?
>>
>>             From skimming the log, it seems to me that there is an
>>             issue at the
>>             socket/network level, which yields the "connection reset
>>             by peer" eror
>>             message.
>>
>>             What is the network between these two hosts like? Is it a
>>             WAN link; is a VPN
>>             or SSH tunnel involved? Do you have other, long-running
>>             TCP sessions between
>>             these peers, and do they experience similar or other
>>             problems? Do the hosts'
>>             link-layer stats hint at problems, e. g. packet loss? Do
>>             the hosts' kernels
>>             leave a message hinting at L2 connectivity problems in
>>             their debug ringbuffers
>>             (`dmesg`) at the time you observe the replication drop out?
>>
>>             -- 
>>             with best regards:
>>             - Johannes Truschnigg ( johannes@xxxxxxxxxxxxxxx
>>             <mailto:johannes@xxxxxxxxxxxxxxx> )
>>
>>             www:   https://johannes.truschnigg.info/
>>             phone: +43 650 2 133337
>>             xmpp:  johannes@xxxxxxxxxxxxxxx
>>             <mailto:johannes@xxxxxxxxxxxxxxx>
>>
>>             Please do not bother me with HTML-email or attachments.
>>             Thank you.
>>
>