Greetings, * Thorsten Schöning (tschoening@xxxxxxxxxx) wrote: > Guten Tag Laurenz Albe, > am Freitag, 23. Juli 2021 um 12:41 schrieben Sie: > > > The only way to cope with that is to replay > > WAL to a point in time *after* the last data file was backed up. > > But with your proposed backup method, you have no way to determine > > that point in time, so this is not safe at all. > > I think your example is bad: If both rows are added in independent > transactions, possibly inconsistent table data simply needs to be > accepted. No backup mechanism will change anything on that, because > the second row might always be inserted "too late". > > If both rows are inserted in the same transaction, things depend on > when that got committed: If it's before pg_stop_backup, it's at least > contained in the last WAL and Postgres can recover to that point. If > it's committed after pg_stop_backup, the change is lost regardless how > files have been copied before. > > > This is exactly the purpose of pg_stop_backup: it does not only > > archive the WAL segment that completes the backup, but it also adds > > a BACKUP_END record to the WAL stream. > > With having pg_start_backup+pg_stop_backup that record is available. Sure it's available, but it will be wrong because the mark in the WAL where BACKUP_END is isn't actually the point you have to replay WAL until to make sure your system is consistent. > > That ensures that recovery > > cannot stop too early: any attempt to end recovery before reaching > > BACKUP_END will cause a fatal error. > > And here's the point: File system snapshots don't have that record at > all, but are considered safe because of the crash safety guarantees > built into Postgres. So as long as the data directory makes somewhat > sense, Postgres obviously is able to start and does some recovering. Atomic filesystem snapshots work because we can figure out what the ending point of the WAL is by simply replaying the WAL that existed at the time of the snapshot. Once we hit the end of the valid WAL, we know that we're done. > That recovering is the only difference: In case of file system > snapshots it recovers beginning from the last CHECKPOINT, as that is > considered safe, and applies additional WALs as available. Yes, it's understood that everything written to the heap prior to the latest checkpoint was sync'd, and trustworthy, and everything written after that point is in the WAL, and we know the end of that WAL because we read through the WAL that existed at the time and discover what the end point was. > With pg_start_backup+copy+pg_stop_backup, the state of the finally > copied data directory is unknown as well. It might be the same > checkpoint when pg_start_backup finished or might be a newer one, > because copying data took so long or whatever. For recovery, Postgres > can only rely on the currently available checkpoint of the data > directory OR what is stored in the backup_label file: > > > START WAL LOCATION: 428/40000060 (file 000000010000042800000040) > > CHECKPOINT LOCATION: 428/40000098 > > BACKUP METHOD: pg_start_backup > > BACKUP FROM: master > > START TIME: 2021-07-23 08:05:25 CEST > > LABEL: postgres_files_full_on_ext4 > > START TIMELINE: 1 > > So what does it do? From my understanding it starts from the > checkpoint location of the file and recovers using the available WALs. > Because that is the only way to not rely on the state of the copied > data directory too much. > > Or does it take a more current checkpoint available in the data > directory into account? From my understanding it can't: The files > declaring that more current checkpoint might have been copied after > all other files have been copied already, but without the data > necessary for the checkpoint in the end. To prevent that, Postgres > starts in the past by using CHECKPOINT LOCATION. > > And here's the point again: When it always starts from the checkpoint > stored in backup_label, from a crash safety perspective it doesn't > make any difference if files are copied before or after > pg_stop_backup. The only difference would be that all data during the > copying process itself would need to be considered lost. But that > might easily be acceptable, backups become outdated all the time. What matters is making sure to play all of the outstanding WAL because until that's done, the database isn't consistent. When it comes to atomic filesystem snapshots, that can be done by replaying from the last checkpoint all of the WAL in the pg_wal directory until it ends- once you get to the end, you know that you've replayed everything and the system is consistent. Similar, once you reach the BACKUP_END record of a backup you know the database is consistent- but only if all of the files that PG was writing to were copied between the start checkpoint of the backup and the BACKUP_END record in the WAL. > > As soon as BACKUP_END is > > processed, the startup process knows that the database is now > > consistent (and logs a message to that extent). > > BACKUP_END will still be processed when copying after pg_stop_backup. Sure, but, again, it'll be wrong and therefore the database would be opened up for reads or could even be promoted before reaching consistency, leading to a corrupt system. It's absolutely critical that all of the WAL be replayed in order for the system to be consistent. > > With your proposed method, you run the danger of stopping recovery too > > early, with the consequence of ending up with a corrupted database. > > But BACKUP_END would be available in both cases. Doesn't matter that it's available if it's in the wrong place. As is often brought up elsewhere, if we don't have to be correct then things are a lot easier and can be made a lot faster. :) Thanks, Stephen
Attachment:
signature.asc
Description: PGP signature