Failover Testing Failures: invalid resource manager ID in primary checkpoint record

Don Seiler <don@xxxxxxxxx> · Wed, 18 Jan 2023 17:47:37 -0600

PostgreSQL 12.13 (PGDG packages) in a streaming replication configuration. pgBackrest 2.43 used for WAL archiving and DB backups to cloud storage
I'm testing and documenting a DR exercise process where I:
Cleanly shutdown PG on the primary
Promote the PG DR replica
Place the standby.signal file on the old primary and start it up (presumes no other configurations need changing, primary_conninfo etc were already set).
My hope is I could just start the old primary / new replica if it was cleanly shutdown prior to promoting the replica. However when I try to start up that new replica, I'm met with:

LOG:  restored log file "00000002000000B70000005A" from archive
LOG:  invalid resource manager ID in primary checkpoint record
PANIC:  could not locate a valid checkpoint record
LOG:  startup process (PID 17660) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure
LOG:  database system is shut down

It doesn't appear any WAL files are missing as it finds all the files that it asks for. Am I missing a piece here?

My hope is to avoid having to do a restore to rebuild the new replica.

Aside for those that may be asking: most of these databases do not have data checksums enabled so pg_rewind isn't in the picture. Although I'm reading now that we could enable the wal_log_hints parameter as an alternative. I'm leery of the overhead but if it's the same overhead that would be done with data checksums then I guess there would be nothing lost when we eventually enable them.

-- 
Don Seiler
www.seiler.us