Re: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

Laurenz Albe <laurenz.albe@xxxxxxxxxxx> · Tue, 17 Jan 2023 09:26:13 +0100

On Mon, 2023-01-16 at 14:37 +0000, HECTOR INGERTO wrote:
> > The database relies on the data being consistent when it performs crash recovery.
> > Imagine that a checkpoint is running while you take your snapshot.  The checkpoint
> > syncs a data file with a new row to disk.  Then it writes a WAL record and updates
> > the control file.  Now imagine that the table with the new row is on a different
> > file system, and your snapshot captures the WAL and the control file, but not
> > the new row (it was still sitting in the kernel page cache when the snapshot was taken).
> > You end up with a lost row.
> > 
> > That is only one scenario.  Many other ways of corruption can happen.
>  
> Can we say then that the risk comes only from the possibility of a checkpoint running
> inside the time gap between the non-simultaneous snapshots?

Another case: a transaction COMMITs, and a slightly later transaction reads the data
and sets a hint bit.  If the snapshot of the file system with the data directory in it
is slightly later than the snapshot of the file system with "pg_wal", the COMMIT might
not be part of the snapshot, but the hint bit could be.

Then these uncommitted data could be visible if you recover from the snapshot.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com