Unable to start replica after failover

"Lahnov, Igor" <Igor.Lahnov@xxxxxxxxxx> · Fri, 29 Jul 2022 15:01:44 +0000

Hi,

We're using asynchronous streaming replication together with WAL archiving and recovery
 (we've set these parameters: archive_command, restore_command, recovery_target_timeline=latest).
To manage backup and recovery, we're using pg_probackup.
We have a 3-node cluster managed by Patroni on Raft.
While running our cluster, we encounter a recurring problem: after a failover, the former
 leader cannot recover neither via the tools provided by PostgreSQL nor with the Patroni tools (pg_rewind).
We've seen a number of different WAL reading errors.
For example:

·        
"could not find previous WAL record at E6F/C2436F50: invalid resource manager ID 139 at E6F/C2436F50"; or

·        
"could not find previous WAL record at 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 000000050000054E000000FB,
 offset 3448832".
There have been some other variations but all such errors have the same cause:

1.      
At startup, the former leader does not immediately get access to the backup. As a result, the recovery process uses the
 local WAL, which inevitably leads to a split brain since in most cases there is a lag in replication after a failover.

2.      
Next, the former leader finally manages to read the latest log from the backup, and that log has a higher priority than
 the local log, even though the recovery with the local log has already occurred.
As a result, we see the following:
"new timeline 3 forked off current database system timeline 2 before current recovery
 point 54E/FB348150".
When pg_rewind is run, it also uses the log from the backup (the lagging log from the
 new leader) instead of the partial log with which the former leader has already been started.
Next, pg_rewind returns errors while reading the log from the backup back, looking for
 the last checkpoint, which is quite reasonable because, once a new leader starts, the point of divergence normally ends up in the next timeline and the previous timeline's backup log does not have a block with the LSN of the divergence.
Also, in this case, it is impossible to build a map of the pages that have been changed
 to roll back the changes using pg_rewind.

To avoid the problem, we decided to stop using restore_command.
 Could you please let us know if there is a better solution to the problem we've described?