Greetings, * Alex Kliukin (alexk@xxxxxxxxxxxx) wrote: > The cloning itself is done by copying a compressed image via ssh, > running the > following command from the replica: > > """ssh {master} 'cd {master_datadir} && tar -lcp --exclude "*.conf" \ > --exclude "recovery.done" \ > --exclude "pacemaker_instanz" \ > --exclude "dont_start" \ > --exclude "pg_log" \ > --exclude "pg_xlog" \ > --exclude "postmaster.pid" \ > --exclude "recovery.done" \ > * | pigz -1 -p 4' | pigz -d -p 4 | tar -xpmUv -C > {slave_datadir}"" > > The WAL archiving starts before the copy starts, as the script that > clones the > replica checks that the WALs archiving is running before the cloning. Maybe you've doing it and haven't mentioned it, but you have to use pg_start/stop_backup because otherwise PG is going to think it's doing crash recovery from the last checkpoint written, rather than having to go back to when the backup started and replay all of the WAL from that point. Basically, this process is entirely broken unless you're actually taking a filesystem-level atomic snapshot first (and that has to be atomic across all tablespaces too). Perhaps that's what you meant when you mentioned a snapshot, but if it, then this definitely isn't good. Note that if you use pg_start/stop_backup, you need to make sure to wait for the replica to be all the way caught up with where the 'pg_start_backup' was issued on the primary before you start copying files on the replica. > We have cloned hundreds of replicas with that procedure and never saw > any > issues, also never saw the "replication checkpoint has wrong magic" > error, so > we are wondering what could be the possible reason behind that failure? > We also > saw the disk error on another shard not long after the initial copy (but > not on > those that had the "replication checkpoint error"), so hardware issues > are on > our list as well (but then how comes both had the same wrong value for > the > "wrong magic"?) If you've not seen any other corruption due to this, I'd call you extremely lucky. I'd strongly suggest you look at some of the existing tools for doing backup/recovery of PG and use them to build out replicas. Thanks! Stephen
Attachment:
signature.asc
Description: Digital signature