Re: Replication failure, slave requesting old segments

"Phil Endecott" <spam_from_pgsql_lists@xxxxxxxxxxxx> · Sun, 12 Aug 2018 22:56:11 +0100

Stephen Frost wrote:
* Phil Endecott (spam_from_pgsql_lists@xxxxxxxxxxxx) wrote:
Stephen Frost wrote:
>* Phil Endecott (spam_from_pgsql_lists@xxxxxxxxxxxx) wrote:
>>2018-08-11 00:12:15.536 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
>>2018-08-11 00:12:15.797 UTC [7954] LOG:  redo starts at 7/D0F956C0
>>2018-08-11 00:12:16.068 UTC [7954] LOG:  consistent recovery state reached at 7/D0FFF088

Are the last two log lines above telling us anything useful?  Is that
saying that, of the 16 MB (0x1000000 byte) WAL file it restored only as
far as byte 0xf956c0 or 0xfff088?  Is that what we would expect?  Is it
trying to use the streaming connection to get the missing bytes from
FFFF088 to FFFFFFFF?  Is that just an empty gap at the end of the file
due to the next record being too big to fit?

The short answer is that, yes, the next record was likely too big to
fit, but that's what the replica was trying to figure out and couldn't
because D0 was gone from the primary already.  One thing which should
help this would be to use physical replication slots on the primary,
which would keep it from throwing away WAL files until it knows the
replica has them, but that runs the risk of ending up with lots of extra
WAL on the primary if the replica is gone for a while.  If you'd prefer
to avoid that then having wal_keep_segments set to '1' would avoid this
particular issue as well, I'd expect.

I do wonder if perhaps we should just default to having it as '1' to
avoid exactly this case, as it seems like perhaps PG archived D0 and
then flipped to D1 and got rid of D0, which is all pretty reasonable,
except that a replica trying to catch up is going to end up asking for
D0 from the primary because it didn't know if there was anything else
that should have been in D0..

OK.  I think this is perhaps a documentation bug, maybe a missing
warning when the master reads its configuration, and maybe (as you say)
a bad default value.

Specifically, section 26.2.5 of the docs says:

"If you use streaming replication without file-based continuous archiving,
the server might recycle old WAL segments before the standby has received
them. If this occurs, the standby will need to be reinitialized from a new
base backup. You can avoid this by setting wal_keep_segments to a value
large enough to ensure that WAL segments are not recycled too early, or by
configuring a replication slot for the standby. If you set up a WAL archive
that's accessible from the standby, these solutions are not required, since
the standby can always use the archive to catch up provided it retains enough
segments."

OR, maybe the WAL reader that process the files that restore_command fetches
could be smart enough to realise that it can skip over the gap at the end?

Anyway.  Do others agree that my issue was the result of 
wal_keep_segments=0 ?

Regards, Phil.