Re: pg_upgrade (and recovery) pitfalls

PO <gunnar.bluth@xxxxxxxxxxx> · Fri, 17 Aug 2018 10:45:56 +0000

Stephen Frost – Thu, 16. August 2018 19:00
> Greetings,

I salute you, Stephen!

TL;DR: I blundered by not spotting an easter egg of my predecessors.

> * PO (gunnar.bluth@xxxxxxxxxxx) wrote:
> > Consider the following scenario/setup:
> > - 4 DB servers in 2 DCs
> > - 1 primary (in DC1)
> > - 1 sync secondary (in other DC)
> > - 2 async secondaries (distributed over DCs)
> 
> I'm a bit surprised that you're ok with the latency imposed by using
> sync replication to another data center. I'm guessing they're pretty
> close to each other?

Yep, they are, as close as they're allowed to to fulfil regulatory requirements.
Availability (and consistency) is far more relevant than performance, machines are at ~5% load most of the time. 

> > - General requirements are:
> > - *always* have one sync secondary online (no exceptions)
> 
> Well, you kind of have to or everything stops. ;)

Guess why we have 4 servers in each cluster ;-)

> > The "naive" idea was to shutdown all instances (starting with the primary to
> enable final syncs), run "pg_upgrade -k" on both the former primary and the
> former sync secondary, re-link the recovery.conf on the secondary, re-enable
> the "primary" IP and start both.
> > 
> > D'oh! The secondary is complaining about a different cluster identifier:
> > "FATAL: database system identifier differs between the primary and standby"
> 
> No, you can't do that.

Well, yeah, learned that the hard way. It does make sense afterall, which is why I called the approach "naive" in the first place.

> > (From looking at the code, I could not determine straight away when and how
> this identifier is generated, but I guess it is somewhere in initdb.c?)
> 
> Yes.
> 
> > So, as we can't rsync (no ssh...), which would probably finish in a few
> seconds, a pg_basebackup is due. Which can be a PITA when the DB size is
> scraping on a TB and you have a single 1GB/sec connection. Bye, bye,
> availability (remember the primary requirement?).
> 
> The rsync *might* finish quickly but it depends a lot on the specifics
> of your environment- for example, the rsync method doesn't do anything
> for unlogged tables, so if you have large unlogged tables you can end up
> with them getting copied over and that can take a long time, so, some
> prep work should be done to make sure you nuke any unlogged tables
> before you go through with the process (or do something similar).

Nah, nothing sophisticated like unlogged tables here ;-)))

> pg_basebackup has the unfortunate issue that it's single-threaded,
> meaning that enabling compression probably will cause the system to
> bottle-neck on the single CPU before reaching your 1Gb/s bandwidth
> limit anyway. You could parallelize the backup/restore using pgbackrest
> or, in recent versions I think, with barman, and that should at least
> get you to be able to fill the 1Gb/s pipe with compressed data for the
> backup. You're likely still looking at an hour or more though to get
> all that data copied over that small a pipe.

Yep, that's the pain about it....

> 
> > ==> So by now, we're only pg_upgrade'ing the primary and follow up with a
> pg_basebackup to the secondary, planning for much longer downtimes. <==
> 
> I have to say that I probably would argue that you should really have at
> least two replicas in the same DC as the primary and then use
> quorom-based syncronous replication. Presumably, that'd also increase
> the bandwidth available to you for rebuilding the replica, reducing the
> downtime associated with that. That might also get you to the point
> where you could use the rsync method that's discussed in the pg_upgrade
> docs to get the replicas back online.
> 
> > After finishing the pg_basebackup, re-link the recovery.conf, start. 
> > The recovery finds history-files from higher timelines in the archive,
> starts to recover those (?) and then complains that the timeline doesn't match
> (don't take the numbers here too seriously, this is from a low-traffic test
> system, the fork off TL 1 was at least a year ago):
> > 
> > restored log file "00000002.history" from archive
> > restored log file "00000003.history" from archive
> > restored log file "00000004.history" from archive
> > FATAL: requested timeline 3 is not a child of this server's history
> > DETAIL: Latest checkpoint is at 9C/36044D28 on timeline 1, but in the
> history of the requested timeline, the server forked off from that timeline at
> 69/88000000.
> > 
> > This mess can probably be cleaned up manually (delete the
> 000000[234].history etc. on both the secondary and the BARMAN archive),
> however to be 100% safe (or when you're unexperienced), you take another
> basebackup :/
> 
> Whoahhhh.... No, this isn't good- once you've done a pg_upgrade, you're
> on a *new* cluster, really. There's no playing forward between an old
> PG server and a new one that's been pg_upgrade'd and you should really
> be using a tool that makes sure you can't end up with a messed up
> archive like that. What seems to be happening here is that your restore
> command is trying to pull from the *old* server's WAL and history files
> and that's *wrong*.

Well, yeah. My mistake really, I should have moved the whole archive out of the way before/while running the initial pg_upgrade.
Something that's not really catered for in Barman :/

> pgbackrest has a way to handle this and keep the stanza name the same by
> using a 'stanza-upgrade', but in no case should a restore command be
> pulling WAL files (of any sort) from the archive of a server with a
> different system identifier. pgbackrest won't let that happen.

That's quite interesting! I'll have to have a closer look at pgbackrest once...

> > And - after moving the *.history files out of the way in the archive - the
> secondary finally starts and starts receiving WALs.
> 
> That's really grotty. :(

Well it got me up & running. Obviously, if there would have been any WAL segments from the former timeline 1 in the archive, it would have become a real mess ;-)))

> 
> > Sidenote: this second problem happened to us as well after a colleague
> promoted a secondary for some R/W tests w/out first disabling the
> archive_command in the morning and then re-built it using pg_basebackup in the
> afternoon.
> > I have to say that it took me quite a while to figure out what was going on
> there... sudden timeline jumps, presumably out of nowhere?!?
> 
> Yikes. Your archive command shouldn't be letting you archive WAL files
> from a system with a different identifier either to the same WAL
> archive. If you've promoted a replica with the same system ID to be a
> primary and started logging on it, well, that's valid and can definitely
> cause an issue for you if you have the old primary still running, so
> you'll want to be careful to try and avoid having that happen.

Yep, but alas! human errors happen. Forget to disable Puppet before running tests, forget to add an "exit 0" in the archive script, ...

> > Now, questions/wishes:
> > - did I do something obvious(ly) wrong?
> 
> Think I've noted a few things..

And it helped, I found the real blunder (see below).

> > - why does a recovery, based on a recovery.conf that points to a reachable
> primary (which obviously communicates its own timeline), still look for higher
> timelines' history-files in the archive and tries to jump onto these
> timelines? This doesn't seem reasoable to me at all...
> 
> PG is going to start from the current timeline and try to find all the
> timelines that it could possibly play forward to and at what point the
> timeline changes were done and then it's going to figure out which
> timeline to go to (by default we try to stick with the currnet timeline,
> but you can configure recovery.conf to specify a different timeline or
> 'latest') and then it's going to request the WAL to get from where PG is
> to the end of whichever timeline it thinks you want. That's all
> entirely reasonable and how things are supposed to work.
> 
> Only once PG reaches the end up what's available through the restore
> command does it start trying to talk to the primary. There's been some
> discussion about how it might be nice to be able to configure PG to
> prefer going to the primary instead, though, really, it should typically
> be faster to replay WAL from a restore_command than to get it from the
> primary over the network, not to mention that getting it from the
> primary will introduce some additional load on the system.

Fair enough, and I onlöy just realised I've been carrying a 
  recovery_target_timeline = 'latest'
around from my predecessors. :facepalm:

> Ultimately though, all of this discussion is a complete side-bar and
> none of it is really the problem you were having, which, as I outline
> above, is that you had a newly pg_upgrade'd PG server trying to get WAL
> from a WAL archive that was associated with the *old* server which had a
> different system ID and that's just not ok, you must not allow that to
> happen. I'm guessing that with barman you'd need to configure it to
> have a new WAL archive when you do the pg_upgrade. I seriously hope
> that it didn't let you archive WAL overtop of existing WAL for the
> *old* database in its WAL archive, that'd be quite bad.

If the old timeline-1-segments would have still been there, barman would have realised that AFAICT. 
But they weren't... knock on wood!

> 
> > - is there a way to have pg_upgrade/initdb use a particular "database system
> identifier" or some kind of "IV" for the new instance, allowing for identical
> upgrades on primary and secondary? Apart from that number (and the system
> tables' OIDs), the data directories should be identical, as far as I
> understand pg_upgrade's functionality? 
> 
> No, this is wrong on a number of levels really.. More than just OIDs
> change during a pg_upgrade, and you can't *and shouldn't* have the same
> database system identifier for what I really two completely different
> database clusters after the pg_upgrade happens.

I wasn't sure really if the OIDs etc. are really "random", hence the inquiry.

> > (- and where do I find that identifier?)
> 
> It's in pg_control, you can view it with the pg_controldata command.
> 
> > - is there a way to have pg_upgrade/initdb initiate the new cluster on the
> same (or a higher) timeline, to prevent f***ing up the archive? If not, I'd
> certainly appreciate such an option! (I'm well aware that I need a new
> basebackup after the upgrade anyway, but alas!)
> 
> You can't play forward across those timelines though- this is something
> that needs to be solved with your process or with your archive command /
> restore command. There's no case where you could play across a
> pg_upgrade, so it doesn't make any sense to have the WAL be mixed like
> what you're suggesting- if anything that'd cause far more problems.

Understood!

> > - any other hints (apart from rsync or other ssh-based methods) how I could
> get around the pg_basebackup would be highly appreciated. Something that only
> clones the system tables and not the bulk of the (identical) data files?
> 
> Well, that's basically what the rsync method does (though it has to also
> create a whole hardlink tree because the new cluster has different
> relfilenodes...).
> 
> > - is my situation (mandantory sync secondary) so unusual that nobody ever
> thought about the above? ;-)
> 
> No, people have, but they solved it using the methods that you've ruled
> out... The suggestions you have above for other ways to approach this
> generally aren't ones that'll actually work.

I'm a person that usually prefers technical solutions over dipping into the red tape realm... but I should probably have the "no ssh between DB servers" rule reviewed one day ;-)

> Also, to be clear, with the approach used by rsync the entire system
> must be shut down (the primary, all the replicas that'll be upgraded
> using that method, etc), so it's not like you could implement that
> approach using pg_basebackup because pg_basebackup wouldn't have
> anything to connect to...

That's also understood.

Anyway, since we're now on 9.6 or 10 with most clusters, I guess we'll just go the hard road a few more times and look into (pg_)logical replication for the next upgrades (we were on mostly 9.3 when this "way of suffering" started).

And we'll get rid of the "recovery_target_timeline" and double check our archive!

> Thanks!
> 
> Stephen

Stephen, much appreciated! 
I owe you a beer or X (in Lisbon?)!

Cheers,

Nick