Re: pg_upgrade (and recovery) pitfalls

Stephen Frost <sfrost@xxxxxxxxxxx> · Thu, 16 Aug 2018 12:53:18 -0400

Greetings,

* PO (gunnar.bluth@xxxxxxxxxxx) wrote:
> Consider the following scenario/setup:
> - 4 DB servers in 2 DCs
>   - 1 primary (in DC1)
>   - 1 sync secondary (in other DC)
>   - 2 async secondaries (distributed over DCs)

I'm a bit surprised that you're ok with the latency imposed by using
sync replication to another data center.  I'm guessing they're pretty
close to each other?

> - General requirements are:
>   - *always* have one sync secondary online (no exceptions)

Well, you kind of have to or everything stops. ;)

> The "naive" idea was to shutdown all instances (starting with the primary to enable final syncs), run "pg_upgrade -k" on both the former primary and the former sync secondary, re-link the recovery.conf on the secondary, re-enable the "primary" IP and start both.
> 
> D'oh! The secondary is complaining about a different cluster identifier:
> "FATAL:  database system identifier differs between the primary and standby"

No, you can't do that.

> (From looking at the code, I could not determine straight away when and how this identifier is generated, but I guess it is somewhere in initdb.c?)

Yes.

> So, as we can't rsync (no ssh...), which would probably finish in a few seconds, a pg_basebackup is due. Which can be a PITA when the DB size is scraping on a TB and you have a single 1GB/sec connection. Bye, bye, availability (remember the primary requirement?).

The rsync *might* finish quickly but it depends a lot on the specifics
of your environment- for example, the rsync method doesn't do anything
for unlogged tables, so if you have large unlogged tables you can end up
with them getting copied over and that can take a long time, so, some
prep work should be done to make sure you nuke any unlogged tables
before you go through with the process (or do something similar).

pg_basebackup has the unfortunate issue that it's single-threaded,
meaning that enabling compression probably will cause the system to
bottle-neck on the single CPU before reaching your 1Gb/s bandwidth
limit anyway.  You could parallelize the backup/restore using pgbackrest
or, in recent versions I think, with barman, and that should at least
get you to be able to fill the 1Gb/s pipe with compressed data for the
backup.  You're likely still looking at an hour or more though to get
all that data copied over that small a pipe.

> ==> So by now, we're only pg_upgrade'ing the primary and follow up with a pg_basebackup to the secondary, planning for much longer downtimes. <==

I have to say that I probably would argue that you should really have at
least two replicas in the same DC as the primary and then use
quorom-based syncronous replication.  Presumably, that'd also increase
the bandwidth available to you for rebuilding the replica, reducing the
downtime associated with that.  That might also get you to the point
where you could use the rsync method that's discussed in the pg_upgrade
docs to get the replicas back online.

> After finishing the pg_basebackup, re-link the recovery.conf, start. 
> The recovery finds history-files from higher timelines in the archive, starts to recover those (?) and then complains that the timeline doesn't match (don't take the numbers here too seriously, this is from a low-traffic test system, the fork off TL 1 was at least a year ago):
> 
> restored log file "00000002.history" from archive
> restored log file "00000003.history" from archive
> restored log file "00000004.history" from archive
> FATAL:  requested timeline 3 is not a child of this server's history
> DETAIL:  Latest checkpoint is at 9C/36044D28 on timeline 1, but in the history of the requested timeline, the server forked off from that timeline at 69/88000000.
> 
> This mess can probably be cleaned up manually (delete the 000000[234].history etc. on both the secondary and the BARMAN archive), however to be 100% safe (or when you're unexperienced), you take another basebackup :/

Whoahhhh....  No, this isn't good- once you've done a pg_upgrade, you're
on a *new* cluster, really.  There's no playing forward between an old
PG server and a new one that's been pg_upgrade'd and you should really
be using a tool that makes sure you can't end up with a messed up
archive like that.  What seems to be happening here is that your restore
command is trying to pull from the *old* server's WAL and history files
and that's *wrong*.

pgbackrest has a way to handle this and keep the stanza name the same by
using a 'stanza-upgrade', but in no case should a restore command be
pulling WAL files (of any sort) from the archive of a server with a
different system identifier.  pgbackrest won't let that happen.

> And - after moving the *.history files out of the way in the archive - the secondary finally starts and starts receiving WALs.

That's really grotty. :(

> Sidenote: this second problem happened to us as well after a colleague promoted a secondary for some R/W tests w/out first disabling the archive_command in the morning and then re-built it using pg_basebackup in the afternoon.
> I have to say that it took me quite a while to figure out what was going on there... sudden timeline jumps, presumably out of nowhere?!?

Yikes.  Your archive command shouldn't be letting you archive WAL files
from a system with a different identifier either to the same WAL
archive.  If you've promoted a replica with the same system ID to be a
primary and started logging on it, well, that's valid and can definitely
cause an issue for you if you have the old primary still running, so
you'll want to be careful to try and avoid having that happen.

> Now, questions/wishes:
> - did I do something obvious(ly) wrong?

Think I've noted a few things..

> - why does a recovery, based on a recovery.conf that points to a reachable primary (which obviously communicates its own timeline), still look for higher timelines' history-files in the archive and tries to jump onto these timelines? This doesn't seem reasoable to me at all...

PG is going to start from the current timeline and try to find all the
timelines that it could possibly play forward to and at what point the
timeline changes were done and then it's going to figure out which
timeline to go to (by default we try to stick with the currnet timeline,
but you can configure recovery.conf to specify a different timeline or
'latest') and then it's going to request the WAL to get from where PG is
to the end of whichever timeline it thinks you want.  That's all
entirely reasonable and how things are supposed to work.

Only once PG reaches the end up what's available through the restore
command does it start trying to talk to the primary.  There's been some
discussion about how it might be nice to be able to configure PG to
prefer going to the primary instead, though, really, it should typically
be faster to replay WAL from a restore_command than to get it from the
primary over the network, not to mention that getting it from the
primary will introduce some additional load on the system.

Ultimately though, all of this discussion is a complete side-bar and
none of it is really the problem you were having, which, as I outline
above, is that you had a newly pg_upgrade'd PG server trying to get WAL
from a WAL archive that was associated with the *old* server which had a
different system ID and that's just not ok, you must not allow that to
happen.  I'm guessing that with barman you'd need to configure it to
have a new WAL archive when you do the pg_upgrade.  I seriously hope
that it didn't let you archive WAL overtop of existing WAL for the
*old* database in its WAL archive, that'd be quite bad.

> - is there a way to have pg_upgrade/initdb use a particular "database system identifier" or some kind of "IV" for the new instance, allowing for identical upgrades on primary and secondary? Apart from that number (and the system tables' OIDs), the data directories should be identical, as far as I understand pg_upgrade's functionality? 

No, this is wrong on a number of levels really..  More than just OIDs
change during a pg_upgrade, and you can't *and shouldn't* have the same
database system identifier for what I really two completely different
database clusters after the pg_upgrade happens.

>   (- and where do I find that identifier?)

It's in pg_control, you can view it with the pg_controldata command.

> - is there a way to have pg_upgrade/initdb initiate the new cluster on the same (or a higher) timeline, to prevent f***ing up the archive? If not, I'd certainly appreciate such an option! (I'm well aware that I need a new basebackup after the upgrade anyway, but alas!)

You can't play forward across those timelines though- this is something
that needs to be solved with your process or with your archive command /
restore command.  There's no case where you could play across a
pg_upgrade, so it doesn't make any sense to have the WAL be mixed like
what you're suggesting- if anything that'd cause far more problems.

> - any other hints (apart from rsync or other ssh-based methods) how I could get around the pg_basebackup would be highly appreciated. Something that only clones the system tables and not the bulk of the (identical) data files?

Well, that's basically what the rsync method does (though it has to also
create a whole hardlink tree because the new cluster has different
relfilenodes...).

> - is my situation (mandantory sync secondary) so unusual that nobody ever thought about the above? ;-)

No, people have, but they solved it using the methods that you've ruled
out...  The suggestions you have above for other ways to approach this
generally aren't ones that'll actually work.

Also, to be clear, with the approach used by rsync the entire system
must be shut down (the primary, all the replicas that'll be upgraded
using that method, etc), so it's not like you could implement that
approach using pg_basebackup because pg_basebackup wouldn't have
anything to connect to...

Thanks!

Stephen
Attachment:
signature.asc

Description: PGP signature