On 9/18/19 8:58 PM, David Steele wrote:
On 9/18/19 9:40 PM, Ron wrote:
I'm concerned with one pgbackrest process stepping over another one and
the restore (or the "pg_ctl start" recovery phase) accidentally
corrupting the production database by writing WAL files to the original
cluster.
This is not an issue unless you seriously game the system. When a
cluster is promoted it selects a new timeline and all WAL will be
archived to the repo on that new timeline. It's possible to promote a
cluster without a timeline switch by tricking it but this is obviously a
bad idea.
What's a timeline switchover?
So, if you promote the new cluster and forget to disable archive_command
there will be no conflict because the clusters will be generating WAL on
separate timelines.
No cluster promotion even contemplated.
The point of the exercise would be to create an older copy of the cluster --
while the production cluster is still running, while production jobs are
still pumping data into the production database -- from before the time of
the data loss, and query it in an attempt to recover the records which were
deleted.
In the case of a future failover a higher timeline will be selected so
there still won't be a conflict.
Unfortunately, that dead WAL from the rogue cluster will persist in the
repo until an PostgreSQL upgrade because expire doesn't know when it can
be removed since it has no context. We're not quite sure how to handle
this but it seems a relatively minor issue, at least as far as
consistency is concerned.
If you do have a split-brain situation where two primaries are archiving
on the same timeline then first-in wins. WAL from the losing primary
will be rejected.
Regards,
--
Angular momentum makes the world go 'round.