Jan, all, * Jan Wieck (jan@xxxxxxxxxx) wrote: > On Thu, Dec 28, 2017 at 1:26 PM, Stephen Frost <sfrost@xxxxxxxxxxx> wrote: > > * Brent Wood (pcreso@xxxxxxxxx) wrote: > > > A tool to calculate a checksum of sorts based on the table (file) > > content would provide a better surety of duplication than simply checking > > file size - like differently vacuumed tables in each copy could have the > > same content but be different file sizes. > > > > PG has support for checksums and there are tools out there to validate > > that the checksum is correct for all pages which have one, but that > > wouldn't help in this case because the file is zero'd out (and a zero'd > > out file is actually a valid file in a PG data directory). > > > > Also, the files on the primary and the replica actually can be different > > when looked at with a complete-file checksum due to hint bits being set > > differently (at least, possibly other ways too). That doesn't make them > > invalid or incorrect though. > > In addition to what Stephen and everyone else said, it is impossible to get > a valid snapshot of the whole "file" on a running server without locking > the relation and reading it through the PostgreSQL buffer cache. On data > files such as heap and index, PostgreSQL does extensive write caching. > Preventing data loss from write caching is a primary purpose of WAL. Write > caching in the application (PostgreSQL in this case) prevents the OS from > actually knowing the correct "logical" state of the file at any given point > in time. This means that even a LVM snapshot will not give you consistent > data files of a running server, because the not yet written changes (in > shared buffers) waiting for a checkpoint to force them into OS buffers > won't be visible from outside PostgreSQL. Indeed, and just to follow that through- that's what WAL exists to correct in the event of a crash. Anything cache'd in PostgreSQL but actually committed will have already been written into the WAL and sync'd to disk, so when PostgreSQL restarts it'll realize that a crash had happened and go back to the last checkpoint in the WAL stream and replay the WAL from there to get back to consistency. This is also how online backups of PostgreSQL happen, but PG can't then start the replay from the *latest* checkpoint, it has to start from the checkpoint where the backup started, and it can't just finish at any point in the WAL stream- it must reach consistency (where the backup finished). The starting location is provided by PostgreSQL when doing a non-exclusive backup to the process issuing the 'pg_stop_backup()' call in its result and must then be put into a file called 'backup_label' when PG is restored from that backup, so that PG knows that a backup was done and where to start WAL replay from. The stopping location is written into the WAL stream when doing a backup from a primary, but if you're doing a backup from a replica then we use the latest checkpoint written into pg_control as the stopping point- but this means that pg_control *must* be copied last when doing a backup from a standby. Sadly, we don't seem to have gotten around to actually documenting this fact as yet, which is really a bug that we should fix and back-patch because it's absolutely critical. The various backup tools written by folks who have delved into all of this deeply handle these requirements, thankfully, but it just goes to show how difficult this all is to get right. :/ Thanks! Stephen
Attachment:
signature.asc
Description: PGP signature