Re: Shortest offline window on database migration

Jeff Janes <jeff.janes@xxxxxxxxx> · Thu, 30 May 2019 21:41:50 -0400

On Thu, May 30, 2019 at 11:08 AM Haroldo Kerry <hkerry@xxxxxxxxxxxxx> wrote:
Hello,
We are migrating our PostgreSQL 9.6.10 database (with streaming replication active) to a faster disk array.
We are using this opportunity to enable checksums, so we will have to do a full backup-restore.
The database size is about 500GB, it takes about 2h:30min for a full backup, and then about 1h to fully restore it with checksum enabled on the new array, plus 2h to recreate the replica on the old array.

As others have noticed, your "trick" won't work.  So back to basics.  Are you using the best degree of parallelization on each one of these tasks?  What is the bottleneck of each one (CPU, disk, network)? how are you creating the replica?  Can you share the actual command lines for each one?  It seems odd that the dump (which only needs to dump the index and constraint definitions) is so much slower than the restore (which actually needs to build those indexes and validate the constraints). Is that because the dump is happening from the old slow disk and restore a new fast ones?  Same with creating the replica, why is that slower than actually doing the restore?

It sounds like you are planning on blowing away the old master server on the old array as soon as the upgrade is complete, so you can re-use that space to build the new replica?  That doesn't seem very safe to me--what if during the rebuilding of the replica you run into a major problem and have to roll the whole thing back?  What will the old array which is holding the current replica server be doing in all of this?

Cheers,

Jeff