Thanks Jeff very well explained. Thanks and have nice weekend
J From: Jeff Janes <jeff.janes@xxxxxxxxx>
On Fri, Dec 20, 2019 at 2:12 PM <soumik.bhattacharjee@xxxxxxx> wrote:
How did you determine that it had all the data? I suspect that they fell out of sync towards the end of the pg_restore, and your method just couldn't detect this fact. At least, I can't think of anything which would cause them to lose
sync exactly at the end of pg_restore. Maybe it was just that the first checkpoint after pg_restore was finished caused the necessary WAL files to be recycled. It could have just as easily been a checkpoint running during the pg_restore which caused the
problem, but by luck it was not.
I don't think you did anything objectively wrong. You could argue that using wal_keep_segments rather than a replication slot was wrong, or you could say it was not wrong but just a calculated risk. In this case, it seems the risk was
realized. Using a replication slot would also be a risk, the risk in that case being that the streaming to replica can't keep up, and so pg_wal fills up to capacity and crashes the master. You have to decide what risk you would rather take.
Usually the master is your production server. Why would you be refreshing it? Where would you be refreshing it from? What other server exists that contains a higher level of truth than what your master production server already has? You can use a replication slot, you can increase wal_keep_segments to a larger value (although there is no way to know with certainty ahead of time what value will be large enough), or you can just deal with the risk that your replica may
occasionally lose sync and need to be recreated. You might also be able to change the topology so your current replica and current master both stream from the higher-source-of-truth server, rather than cascading changes, first logically and then physically.
There is no correct answer, you have to understand and weigh the balance of risks for yourself. Cheers, Jeff |