Re: walreceiver termination

Justin King <kingpin867@xxxxxxxxx> · Thu, 23 Apr 2020 14:51:22 -0500

I assume it would be related to the following:

LOG:  incorrect resource manager data checksum in record at 2D6/C259AB90

since the walreceiver terminates just after this - but I'm unclear
what precisely this means.  Without digging into the code, I would
guess that it's unable to verify the checksum on the segment it just
received from the master; however, there are multiple replicas here,
so it points to an issue on this client.  However, it happens
everywhere -- we have ~16 replicas across 3 different clusters (on
different versions) and we see this uniformly across them all at
seemingly random times.  Also, just to clarify, this will only happen
on a single replica at a time.

On Thu, Apr 23, 2020 at 2:46 PM Justin King <kingpin867@xxxxxxxxx> wrote:
>
> On Thu, Apr 23, 2020 at 12:47 PM Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
> >
> > Justin King <kingpin867@xxxxxxxxx> writes:
> > > We've seen unexpected termination of the WAL receiver process.  This
> > > stops streaming replication, but the replica stays available --
> > > restarting the server resumes streaming replication where it left off.
> > > We've seen this across nearly every recent version of PG, (9.4, 9.5,
> > > 11.x, 12.x) -- anything omitted is one we haven't used.
> >
> > > I don't have an explanation for the cause, but I was able to set
> > > logging to "debug5" and run an strace of the walrecevier PID when it
> > > eventually happened.  It appears as if the SIGTERM is coming from the
> > > "postgres: startup" process.
> >
> > The startup process intentionally SIGTERMs the walreceiver under
> > various circumstances, so I'm not sure that there's any surprise
> > here.  Have you checked the postmaster log?
> >
> >                         regards, tom lane
>
> Yep, I included "debug5" output of the postmaster log in the initial post.