Re: walreceiver termination

Kyotaro Horiguchi <horikyota.ntt@xxxxxxxxx> · Fri, 08 May 2020 09:56:03 +0900 (JST)

Hello.

At Mon, 4 May 2020 09:09:15 -0500, Justin King <kingpin867@xxxxxxxxx> wrote in 
> Would there be anyone that might be able to help troubleshoot this
> issue -- or at least give me something that would be helpful to look
> for?
> 
> https://www.postgresql.org/message-id/flat/CAGH8ccdWLLGC7qag5pDUFbh96LbyzN_toORh2eY32-2P1%3Dtifg%40mail.gmail.com
> https://www.postgresql.org/message-id/flat/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com
> https://dba.stackexchange.com/questions/116569/postgresql-docker-incorrect-resource-manager-data-checksum-in-record-at-46f-6
> 
> I'm not the first one to report something similar and all the
> complaints have a different filesystem in common -- particularly ZFS
> (or btrfs, in the bottom case).  Is there anything more we can do here
> to help narrow down this issue?  I'm happy to help, but I honestly
> wouldn't even know where to begin.

The sendto() call at the end of your strace output is "close
connecion" request to wal sender and normally should be followed by
close() and kill().  If it is really the last strace output, the
sendto() is being blocked with buffer-full.

My diagnosis of the situation is that your replication connection had
a trouble and the TCP session is broken in the way wal receiver
couldn't be aware of the breakage.  As the result feedback message
packets from wal receiver were detained in tcp send buffer then
finally the last sendto() was blocked while sending the
close-connection message.

If it happens constantly, routers or firewalls between the primary and
standby may be discarding sessions inadvertantly.

I'm not sure how ZFS can be involved in this trouble, though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center