Greetings, * bricklen (bricklen@xxxxxxxxx) wrote: > On Tue, Aug 14, 2018 at 10:45 AM Edmundo Robles <edmundo@xxxxxxxxxxxx> > wrote: > > Is safe to upgrade from pg 9.3 to pg 10 directly using pg_upgrade or > > is better upgrade, with pg_upgrade, from 9.3 -> 9.4 ->9.5 -> 9.6 -> 10. > > Using pg_upgrade, it is definitely possible to upgrade 9.3 to 10 in one > jump. We did this in production earlier this year for 1500 Postgres > clusters. > At https://bricklen.github.io/2018-03-27-Postgres-10-upgrade/ I documented > (in excrutiating detail, perhaps) how we upgraded those clusters from 9.3 > to 10 with sub-15 minute downtime per cluster. The only real issues we ran > into were some corrupted indexes that appeared to be related to 10.1 and > 10.2. Upgrading to 10.3/10.4 and reindexing fixed those particular > corrupted indexes. Interesting write-up. A few points: #1: I'd strongly recommend including something in the write-up about checking for unlogged tables. Those can add quite a bit of time to the rsync if they aren't truncated before the primary cluster is shut down. #2: The issue with timelines leads me to suspect that you had a restore_command defined and that when PG started up, it found the timeline history files from the old cluster. If you don't have a restore_command set for any of these then I'm very curious what happened. The other possibility (though I wouldn't have expected a timeline complaint from PG...) is that the replica wasn't fully up to date for whatever reason. #3: There's a number of checks discussed in the upgrade documentation around using the rsync-based method, but it doesn't seem like you did those. A mention of them might be good. Note that these are particularly important because the rsync will *not* copy over changes to the data files except in the relatively rare case of the relfilenode's size changing (given that most of them are 1G, that's not too likely). The note you have about the rsync taking more time due to "if the remote replica was fully caught up when it was shut down" isn't accurate- there is no WAL replay that happens on the replica using this method to 'catch up' and if WAL replay was required to have this process be correct then it simply wouldn't ever work. #4: pg_upgrade absolutely resets the timeline to '1', and you shouldn't ever copy over history files from the old cluster to the new cluster. The new replicas will *also* be on timeline '1'. #5: There's no such thing as a 'timeline decrement'. The new cluster (either on the primary or the replica) should only ever see itself as being on timeline '1' when starting up after the pg_upgrade and before a promotion happens. #6: In note 33, it's unclear what this is referring to. There's no WAL which would have been generated by the pg_upgrade (that process is not WAL'd). Perhaps some activity needed to be done on the primary before a new restorepoint would happen on the replica, due to how pg_basebackup needs a restorepoint to begin from when working on a replica. Thanks! Stephen
Attachment:
signature.asc
Description: PGP signature