On Sat, Aug 8, 2020 at 2:24 PM Don Seiler <don@xxxxxxxxx> wrote:
Doing some more testing and it seems to boil down to the fact that a replica should create a .partial file when it is promoted. I tested both PG 9.6 and 10, with replicas streaming vs restoring WAL files. The sequence was always this:In cases where the replica B was a streaming replica, upon promotion it would create a .partial file in the old timeline and then a history file in the new timeline and proceed. These files were then archived by wal-g and the new replica (A) would recover them and continue replication.
- Stop primary (A)
- Observe primary writes on final WAL file and it is archived
- Promote replica (B)
- Add recovery.conf to A for just WAL recovery and start.
If replica B was not streaming to begin with (so it was just restoring and recovering WAL from archives), then upon promotion it would NOT create a .partial file. It would just create the .history file and open up. When A is then started in step 4, it would complain like this (in PG 9.6):2020-08-08 19:13:43.337 UTC [17837] LOG: database system was shut down at 2020-08-08 19:09:24 UTC
2020-08-08 19:13:44.580 UTC [17837] LOG: restored log file "00000002.history" from archive
ERROR: 2020/08/08 19:13:45.213175 Archive '00000003.history' does not exist.
2020-08-08 19:13:45.214 UTC [17837] LOG: entering standby mode
2020-08-08 19:13:45.676 UTC [17837] LOG: restored log file "00000002.history" from archive2020-08-08 19:13:46.327 UTC [17837] LOG: restored log file "000000020000000000000006" from archive
2020-08-08 19:13:46.381 UTC [17837] LOG: invalid xl_info in primary checkpoint record
2020-08-08 19:13:46.381 UTC [17837] PANIC: could not locate a valid checkpoint record
2020-08-08 19:13:48.095 UTC [17836] LOG: startup process (PID 17837) was terminated by signal 6: Aborted
2020-08-08 19:13:48.095 UTC [17836] LOG: aborting startup due to startup process failure2020-08-08 19:13:48.105 UTC [17836] LOG: database system is shut downI even tested forcing an additional log switch from the new primary (which is the example you see above) but it still doesn't jive.I'm not sure if this is a bug or expected behavior with non-streaming replicas but I believe it means that all of my old primaries need to be restored from new backups as the new primaries were all restore/recovery setup (due to VPN bandwidth limitations that made streaming difficult). Unless someone has a way forward?
I'm following up to see if anyone else would categorize this as a bug. It renders my old primaries dead until I do a restore on them.
Don.
Don Seiler
www.seiler.us
www.seiler.us