Shortest offline window on database migration

Haroldo Kerry <hkerry@xxxxxxxxxxxxx> · Thu, 30 May 2019 12:08:04 -0300

Hello,
We are migrating our PostgreSQL 9.6.10 database (with streaming replication active) to a faster disk array.
We are using this opportunity to enable checksums, so we will have to do a full backup-restore.
The database size is about 500GB, it takes about 2h:30min for a full backup, and then about 1h to fully restore it with checksum enabled on the new array, plus 2h to recreate the replica on the old array.

Although all synthetic tests (pgbench) indicate the new disk array is faster, we will only be 100% confident once we see its performance on production, so our backup plan is using our replica database on the older array. If the new array performance is poor during production ramp up, we can switch to the replica with little impact to our customers.

Problem is the offline window for backup, restore the full database with checksum and recreate the replica is about 5h:30m.

One thing that occurred to us to shorten the offline window was restoring the database to both the master and replica in parallel (of course we would configure the replica as master do restore the database), that would shave 1h of the total time. Although this is not documented we thought that restoring the same database to identical servers would result in binary identical data files.

We tried this in lab. As this is not a kosher way to create a replica, we ran a checksum comparison of all data files, and we ended up having a lot of differences. Bummer. Both master and replica worked (no errors on logs), but we ended up insecure about this path because of the binary differences on data files.
But in principle it should work, right?
Has anyone been through this type of problem?

Regards,
Haroldo Kerry