Re: What's the dfifference between pg_start_backup+copy+pg_stop_backup+WAL vs. pg_start_backup+pg_stop_backup+copy+WAL?

Stephen Frost <sfrost@xxxxxxxxxxx> · Fri, 23 Jul 2021 11:24:19 -0400

Greetings,

* Thorsten Schöning (tschoening@xxxxxxxxxx) wrote:
> Guten Tag Laurenz Albe,
> am Freitag, 23. Juli 2021 um 12:41 schrieben Sie:
> 
> > The only way to cope with that is to replay
> > WAL to a point in time *after* the last data file was backed up.
> > But with your proposed backup method, you have no way to determine
> > that point in time, so this is not safe at all.
> 
> I think your example is bad: If both rows are added in independent
> transactions, possibly inconsistent table data simply needs to be
> accepted. No backup mechanism will change anything on that, because
> the second row might always be inserted "too late".
> 
> If both rows are inserted in the same transaction, things depend on
> when that got committed: If it's before pg_stop_backup, it's at least
> contained in the last WAL and Postgres can recover to that point. If
> it's committed after pg_stop_backup, the change is lost regardless how
> files have been copied before.
> 
> > This is exactly the purpose of pg_stop_backup: it does not only
> > archive the WAL segment that completes the backup, but it also adds
> > a BACKUP_END record to the WAL stream.
> 
> With having pg_start_backup+pg_stop_backup that record is available.

Sure it's available, but it will be wrong because the mark in the WAL
where BACKUP_END is isn't actually the point you have to replay WAL
until to make sure your system is consistent.

> > That ensures that recovery
> > cannot stop too early: any attempt to end recovery before reaching
> > BACKUP_END will cause a fatal error.
> 
> And here's the point: File system snapshots don't have that record at
> all, but are considered safe because of the crash safety guarantees
> built into Postgres. So as long as the data directory makes somewhat
> sense, Postgres obviously is able to start and does some recovering.

Atomic filesystem snapshots work because we can figure out what the
ending point of the WAL is by simply replaying the WAL that existed at
the time of the snapshot.  Once we hit the end of the valid WAL, we know
that we're done.

> That recovering is the only difference: In case of file system
> snapshots it recovers beginning from the last CHECKPOINT, as that is
> considered safe, and applies additional WALs as available.

Yes, it's understood that everything written to the heap prior to the
latest checkpoint was sync'd, and trustworthy, and everything written
after that point is in the WAL, and we know the end of that WAL because
we read through the WAL that existed at the time and discover what the
end point was.

> With pg_start_backup+copy+pg_stop_backup, the state of the finally
> copied data directory is unknown as well. It might be the same
> checkpoint when pg_start_backup finished or might be a newer one,
> because copying data took so long or whatever. For recovery, Postgres
> can only rely on the currently available checkpoint of the data
> directory OR what is stored in the backup_label file:
> 
> > START WAL LOCATION: 428/40000060 (file 000000010000042800000040)
> > CHECKPOINT LOCATION: 428/40000098
> > BACKUP METHOD: pg_start_backup
> > BACKUP FROM: master
> > START TIME: 2021-07-23 08:05:25 CEST
> > LABEL: postgres_files_full_on_ext4
> > START TIMELINE: 1
> 
> So what does it do? From my understanding it starts from the
> checkpoint location of the file and recovers using the available WALs.
> Because that is the only way to not rely on the state of the copied
> data directory too much.
> 
> Or does it take a more current checkpoint available in the data
> directory into account? From my understanding it can't: The files
> declaring that more current checkpoint might have been copied after
> all other files have been copied already, but without the data
> necessary for the checkpoint in the end. To prevent that, Postgres
> starts in the past by using CHECKPOINT LOCATION.
> 
> And here's the point again: When it always starts from the checkpoint
> stored in backup_label, from a crash safety perspective it doesn't
> make any difference if files are copied before or after
> pg_stop_backup. The only difference would be that all data during the
> copying process itself would need to be considered lost. But that
> might easily be acceptable, backups become outdated all the time.

What matters is making sure to play all of the outstanding WAL because
until that's done, the database isn't consistent.  When it comes to
atomic filesystem snapshots, that can be done by replaying from the last
checkpoint all of the WAL in the pg_wal directory until it ends- once
you get to the end, you know that you've replayed everything and the
system is consistent.  Similar, once you reach the BACKUP_END record of
a backup you know the database is consistent- but only if all of the
files that PG was writing to were copied between the start checkpoint
of the backup and the BACKUP_END record in the WAL.

> > As soon as BACKUP_END is
> > processed, the startup process knows that the database is now
> > consistent (and logs a message to that extent).
> 
> BACKUP_END will still be processed when copying after pg_stop_backup.

Sure, but, again, it'll be wrong and therefore the database would be
opened up for reads or could even be promoted before reaching
consistency, leading to a corrupt system.  It's absolutely critical that
all of the WAL be replayed in order for the system to be consistent.

> > With your proposed method, you run the danger of stopping recovery too
> > early, with the consequence of ending up with a corrupted database.
> 
> But BACKUP_END would be available in both cases.

Doesn't matter that it's available if it's in the wrong place.

As is often brought up elsewhere, if we don't have to be correct then
things are a lot easier and can be made a lot faster. :)

Thanks,

Stephen
Attachment:
signature.asc

Description: PGP signature