Re: Does PostgreSQL check database integrity at startup?

Stephen Frost <sfrost@xxxxxxxxxxx> · Fri, 29 Dec 2017 20:09:03 -0500

Jan, all,

* Jan Wieck (jan@xxxxxxxxxx) wrote:
> On Thu, Dec 28, 2017 at 1:26 PM, Stephen Frost <sfrost@xxxxxxxxxxx> wrote:
> > * Brent Wood (pcreso@xxxxxxxxx) wrote:
> > > A tool to calculate a checksum of sorts based on the table (file)
> > content would provide a better surety of duplication than simply checking
> > file size - like differently vacuumed tables in each copy could have the
> > same content but be different file sizes.
> >
> > PG has support for checksums and there are tools out there to validate
> > that the checksum is correct for all pages which have one, but that
> > wouldn't help in this case because the file is zero'd out (and a zero'd
> > out file is actually a valid file in a PG data directory).
> >
> > Also, the files on the primary and the replica actually can be different
> > when looked at with a complete-file checksum due to hint bits being set
> > differently (at least, possibly other ways too).  That doesn't make them
> > invalid or incorrect though.
> 
> In addition to what Stephen and everyone else said, it is impossible to get
> a valid snapshot of the whole "file" on a running server without locking
> the relation and reading it through the PostgreSQL buffer cache. On data
> files such as heap and index, PostgreSQL does extensive write caching.
> Preventing data loss from write caching is a primary purpose of WAL. Write
> caching in the application (PostgreSQL in this case) prevents the OS from
> actually knowing the correct "logical" state of the file at any given point
> in time. This means that even a LVM snapshot will not give you consistent
> data files of a running server, because the not yet written changes (in
> shared buffers) waiting for a checkpoint to force them into OS buffers
> won't be visible from outside PostgreSQL.

Indeed, and just to follow that through- that's what WAL exists to
correct in the event of a crash.  Anything cache'd in PostgreSQL but
actually committed will have already been written into the WAL and
sync'd to disk, so when PostgreSQL restarts it'll realize that a crash
had happened and go back to the last checkpoint in the WAL stream and
replay the WAL from there to get back to consistency.

This is also how online backups of PostgreSQL happen, but PG can't then
start the replay from the *latest* checkpoint, it has to start from the
checkpoint where the backup started, and it can't just finish at any
point in the WAL stream- it must reach consistency (where the backup
finished).  The starting location is provided by PostgreSQL when doing a
non-exclusive backup to the process issuing the 'pg_stop_backup()' call
in its result and must then be put into a file called 'backup_label'
when PG is restored from that backup, so that PG knows that a backup was
done and where to start WAL replay from.

The stopping location is written into the WAL stream when doing a backup
from a primary, but if you're doing a backup from a replica then we use
the latest checkpoint written into pg_control as the stopping point- but
this means that pg_control *must* be copied last when doing a backup
from a standby.  Sadly, we don't seem to have gotten around to actually
documenting this fact as yet, which is really a bug that we should fix
and back-patch because it's absolutely critical.

The various backup tools written by folks who have delved into all of
this deeply handle these requirements, thankfully, but it just goes to
show how difficult this all is to get right. :/

Thanks!

Stephen
Attachment:
signature.asc

Description: PGP signature