Re: Unable to start replica after failover

Kyotaro Horiguchi <horikyota.ntt@xxxxxxxxx> · Tue, 23 Aug 2022 15:07:53 +0900 (JST)

> What additional information is needed?

Usually server logs and the output of pg_rewind at the trouble time
are needed as the first step.

> Next, pg_rewind returns errors while reading the log from the backup
> back, looking for the last checkpoint, which is quite reasonable
> because, once a new leader starts, the point of divergence normally
> ends up in the next timeline and the previous timeline's backup log
> does not have a block with the LSN of the divergence.

That sounds like pg_rewind is a crap.

pg_rewind reads timeline history files from the both servers to find
the last timeline up to where the two servers share the same history
then determine the divergence point at the latest LSN where the two
servers are known to share. Then it overwrites the pages modified
since the common checkpoint until the last (shutdown) checkpoint on
the previous leader that are modified in the *previous* timeline on the
former leader by the data of the same pages *on the new leader*. No
need for page data from the older timeline. If nothing's going wrong,
pg_rewind is not expected to face the situation of:

> could not find previous WAL record at E6F/C2436F50: invalid resource manager ID 139 at E6F/C2436F50
> could not find previous WAL record at 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 000000050000054E000000FB, 

Which means the WAL files are somehow broken.

> When pg_rewind is run, it also uses the log from the backup (the
> lagging log from the new leader) instead of the partial log with
> which the former leader has already been started.

I don't see how come the former leader doesn't have access to the
partial log (or the latest WAL file, I suppose)? It is essential for
pg_rewind to work (since it exists nowhere other than there) and it
must be in pg_wal directory unless someone removed it.

Thus, I think we need the exact steps you and your system took after
the failover happened about postgresql.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center