Hi Vladimir, On Thu, Apr 2, 2015 at 2:07 PM, Vladimir Borodin < root@xxxxxxxxxxx> wrote: Hi, Alexey.
The new replica did start and were restoring WAL files for a while, but eventually we came across the PANIC message:
2015-03-18 19:10:52.943 CET,,,17293,,55083494.438d,922,,2015-03-17 15:05:08 CET,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,"xlog redo visible: rel 1663/16414/24453; blk 26569",,,,""
We did check the disk on that system (and now rechecking the memory), but so far the hardware itself looks ok, which makes me wonder if the procedure above is flawed? What would be the proper way to produce a base backup from the standby without using pg_basebackup?
If you still want to use your own solution, you could look at how barman actually does it. It has an ability to take backups from replics and uses pgespresso [1] extension for it.
Thank you, pgespresso wraps the start/stop backup functionality designed for the streaming replication into the user-callable functions (with a timeline hack for the replica). While it's a good solution on its own, I'm wondering if the start/stop backup on master, together with archiving WAL segments and copying data from the replica should produce a valid base-backup (and the replica produced from it) as well.
Well, I haven’t ever tried to do so, but I think the reason that replica starts applying WALs from too late location is that you do not copy backup label file from master after issuing pg_start_backup. Does your tool copy it from master?
According to doc [0]:
It's also worth noting that the pg_start_backup function makes a file named backup_label in the database cluster directory, which is removed by pg_stop_backup . This file will of course be archived as a part of your backup dump file. The backup label file includes the label string you gave to pg_start_backup , as well as the time at which pg_start_backup was run, and the name of the starting WAL file. In case of confusion it is therefore possible to look inside a backup dump file and determine exactly which backup session the dump file came from. However, this file is not merely for your information; its presence and contents are critical to the proper operation of the system's recovery process.
Intuitively, it looks like a delay between the master and the replica might result in them having different 'states' (say, atomic snapshots of data/base files) of the database at the point P when the base backup is started (say, master at state B, replica at earlier state A), and since P is determined from the master, the changes to transform the replica from state A to state B might not be included in the sequence of WALs to replay on the new replica.
Alexey
--
May the force be with you…
|