Greetings, * Ron (ronljohnsonjr@xxxxxxxxx) wrote: > On 9/18/19 8:58 PM, David Steele wrote: > >On 9/18/19 9:40 PM, Ron wrote: > >>I'm concerned with one pgbackrest process stepping over another one and > >>the restore (or the "pg_ctl start" recovery phase) accidentally > >>corrupting the production database by writing WAL files to the original > >>cluster. > >This is not an issue unless you seriously game the system. When a > >cluster is promoted it selects a new timeline and all WAL will be > >archived to the repo on that new timeline. It's possible to promote a > >cluster without a timeline switch by tricking it but this is obviously a > >bad idea. > > What's a timeline switchover? Put simply, it's a branch off of the current WAL stream on to a new WAL stream and it happens whenever there's a promotion. Forgive the ASCII art, but- ----A---> timeline 1, where things start \-----> a promotion happened at time A, new WAL is on timeline 2 Consider an async replication scenario, where the network on the primary is lost but it keeps writing out WAL and accepting new commits, but at a time "A" we give up on it and promote the replica, so the replica switches to timeline 2 and starts accepting writes. Now we are in a situation where two systems are generating WAL (the network partitioned old primary, and the replica-now-primary). Having the promotion switch to a timeline makes it clear where that promotion happened and where the replica-now-primary's WAL stream started. This is actually what pg_rewind is based around too- to re-master the old primary, it'll find that split point A and "rewind" (well, not really, because it just grabs the pages, but whatever) the old primary back to A and then the old primary can follow the new primary on timeline 2. > >So, if you promote the new cluster and forget to disable archive_command > >there will be no conflict because the clusters will be generating WAL on > >separate timelines. > > No cluster promotion even contemplated. Ah, but you are talking about a cluster promotion, though you don't realize it. Any time there is a "at some point, I was to stop replaying WAL and start accepting new changes", there's a timeline switch and notionally a promotion. > The point of the exercise would be to create an older copy of the cluster -- > while the production cluster is still running, while production jobs are > still pumping data into the production database -- from before the time of > the data loss, and query it in an attempt to recover the records which were > deleted. Sure, that's all entirely possible and shouldn't be an issue. When you go through the restore process and specify a point where you want the restore to stop, so that you can connect and pull the down the table, when PG reaches that point it'll promote and do a timeline switch. Now, if you don't actually want that restore to promote and come up as a system that you can write to, you could instead say 'pause', and then connect to the database and grab whatever data you needed. That should also avoid the concern around archive command, provided you never actually let that system finish recovery and instead just shut it down while it's still read-only. If you want to play around with this stuff and see what happens with a promote, or try doing a pause instead, you might be interested in: https://learn.crunchydata.com/ and specifically the pgbackrest one: https://learn.crunchydata.com/pg-administration/courses/basic-postgresql-for-dbas/pgbackrest/ Basically, it's kind of like a blog post where you can play around on a scratch system that's built into the page and click through the steps to see what happens, and change things around if you want. Thanks, Stephen
Attachment:
signature.asc
Description: PGP signature