Greetings, * PO (gunnar.bluth@xxxxxxxxxxx) wrote: > Consider the following scenario/setup: > - 4 DB servers in 2 DCs > - 1 primary (in DC1) > - 1 sync secondary (in other DC) > - 2 async secondaries (distributed over DCs) I'm a bit surprised that you're ok with the latency imposed by using sync replication to another data center. I'm guessing they're pretty close to each other? > - General requirements are: > - *always* have one sync secondary online (no exceptions) Well, you kind of have to or everything stops. ;) > The "naive" idea was to shutdown all instances (starting with the primary to enable final syncs), run "pg_upgrade -k" on both the former primary and the former sync secondary, re-link the recovery.conf on the secondary, re-enable the "primary" IP and start both. > > D'oh! The secondary is complaining about a different cluster identifier: > "FATAL: database system identifier differs between the primary and standby" No, you can't do that. > (From looking at the code, I could not determine straight away when and how this identifier is generated, but I guess it is somewhere in initdb.c?) Yes. > So, as we can't rsync (no ssh...), which would probably finish in a few seconds, a pg_basebackup is due. Which can be a PITA when the DB size is scraping on a TB and you have a single 1GB/sec connection. Bye, bye, availability (remember the primary requirement?). The rsync *might* finish quickly but it depends a lot on the specifics of your environment- for example, the rsync method doesn't do anything for unlogged tables, so if you have large unlogged tables you can end up with them getting copied over and that can take a long time, so, some prep work should be done to make sure you nuke any unlogged tables before you go through with the process (or do something similar). pg_basebackup has the unfortunate issue that it's single-threaded, meaning that enabling compression probably will cause the system to bottle-neck on the single CPU before reaching your 1Gb/s bandwidth limit anyway. You could parallelize the backup/restore using pgbackrest or, in recent versions I think, with barman, and that should at least get you to be able to fill the 1Gb/s pipe with compressed data for the backup. You're likely still looking at an hour or more though to get all that data copied over that small a pipe. > ==> So by now, we're only pg_upgrade'ing the primary and follow up with a pg_basebackup to the secondary, planning for much longer downtimes. <== I have to say that I probably would argue that you should really have at least two replicas in the same DC as the primary and then use quorom-based syncronous replication. Presumably, that'd also increase the bandwidth available to you for rebuilding the replica, reducing the downtime associated with that. That might also get you to the point where you could use the rsync method that's discussed in the pg_upgrade docs to get the replicas back online. > After finishing the pg_basebackup, re-link the recovery.conf, start. > The recovery finds history-files from higher timelines in the archive, starts to recover those (?) and then complains that the timeline doesn't match (don't take the numbers here too seriously, this is from a low-traffic test system, the fork off TL 1 was at least a year ago): > > restored log file "00000002.history" from archive > restored log file "00000003.history" from archive > restored log file "00000004.history" from archive > FATAL: requested timeline 3 is not a child of this server's history > DETAIL: Latest checkpoint is at 9C/36044D28 on timeline 1, but in the history of the requested timeline, the server forked off from that timeline at 69/88000000. > > This mess can probably be cleaned up manually (delete the 000000[234].history etc. on both the secondary and the BARMAN archive), however to be 100% safe (or when you're unexperienced), you take another basebackup :/ Whoahhhh.... No, this isn't good- once you've done a pg_upgrade, you're on a *new* cluster, really. There's no playing forward between an old PG server and a new one that's been pg_upgrade'd and you should really be using a tool that makes sure you can't end up with a messed up archive like that. What seems to be happening here is that your restore command is trying to pull from the *old* server's WAL and history files and that's *wrong*. pgbackrest has a way to handle this and keep the stanza name the same by using a 'stanza-upgrade', but in no case should a restore command be pulling WAL files (of any sort) from the archive of a server with a different system identifier. pgbackrest won't let that happen. > And - after moving the *.history files out of the way in the archive - the secondary finally starts and starts receiving WALs. That's really grotty. :( > Sidenote: this second problem happened to us as well after a colleague promoted a secondary for some R/W tests w/out first disabling the archive_command in the morning and then re-built it using pg_basebackup in the afternoon. > I have to say that it took me quite a while to figure out what was going on there... sudden timeline jumps, presumably out of nowhere?!? Yikes. Your archive command shouldn't be letting you archive WAL files from a system with a different identifier either to the same WAL archive. If you've promoted a replica with the same system ID to be a primary and started logging on it, well, that's valid and can definitely cause an issue for you if you have the old primary still running, so you'll want to be careful to try and avoid having that happen. > Now, questions/wishes: > - did I do something obvious(ly) wrong? Think I've noted a few things.. > - why does a recovery, based on a recovery.conf that points to a reachable primary (which obviously communicates its own timeline), still look for higher timelines' history-files in the archive and tries to jump onto these timelines? This doesn't seem reasoable to me at all... PG is going to start from the current timeline and try to find all the timelines that it could possibly play forward to and at what point the timeline changes were done and then it's going to figure out which timeline to go to (by default we try to stick with the currnet timeline, but you can configure recovery.conf to specify a different timeline or 'latest') and then it's going to request the WAL to get from where PG is to the end of whichever timeline it thinks you want. That's all entirely reasonable and how things are supposed to work. Only once PG reaches the end up what's available through the restore command does it start trying to talk to the primary. There's been some discussion about how it might be nice to be able to configure PG to prefer going to the primary instead, though, really, it should typically be faster to replay WAL from a restore_command than to get it from the primary over the network, not to mention that getting it from the primary will introduce some additional load on the system. Ultimately though, all of this discussion is a complete side-bar and none of it is really the problem you were having, which, as I outline above, is that you had a newly pg_upgrade'd PG server trying to get WAL from a WAL archive that was associated with the *old* server which had a different system ID and that's just not ok, you must not allow that to happen. I'm guessing that with barman you'd need to configure it to have a new WAL archive when you do the pg_upgrade. I seriously hope that it didn't let you archive WAL overtop of existing WAL for the *old* database in its WAL archive, that'd be quite bad. > - is there a way to have pg_upgrade/initdb use a particular "database system identifier" or some kind of "IV" for the new instance, allowing for identical upgrades on primary and secondary? Apart from that number (and the system tables' OIDs), the data directories should be identical, as far as I understand pg_upgrade's functionality? No, this is wrong on a number of levels really.. More than just OIDs change during a pg_upgrade, and you can't *and shouldn't* have the same database system identifier for what I really two completely different database clusters after the pg_upgrade happens. > (- and where do I find that identifier?) It's in pg_control, you can view it with the pg_controldata command. > - is there a way to have pg_upgrade/initdb initiate the new cluster on the same (or a higher) timeline, to prevent f***ing up the archive? If not, I'd certainly appreciate such an option! (I'm well aware that I need a new basebackup after the upgrade anyway, but alas!) You can't play forward across those timelines though- this is something that needs to be solved with your process or with your archive command / restore command. There's no case where you could play across a pg_upgrade, so it doesn't make any sense to have the WAL be mixed like what you're suggesting- if anything that'd cause far more problems. > - any other hints (apart from rsync or other ssh-based methods) how I could get around the pg_basebackup would be highly appreciated. Something that only clones the system tables and not the bulk of the (identical) data files? Well, that's basically what the rsync method does (though it has to also create a whole hardlink tree because the new cluster has different relfilenodes...). > - is my situation (mandantory sync secondary) so unusual that nobody ever thought about the above? ;-) No, people have, but they solved it using the methods that you've ruled out... The suggestions you have above for other ways to approach this generally aren't ones that'll actually work. Also, to be clear, with the approach used by rsync the entire system must be shut down (the primary, all the replicas that'll be upgraded using that method, etc), so it's not like you could implement that approach using pg_basebackup because pg_basebackup wouldn't have anything to connect to... Thanks! Stephen
Attachment:
signature.asc
Description: PGP signature