Greetings, * Thorsten Schöning (tschoening@xxxxxxxxxx) wrote: > Guten Tag Stephen Frost, > am Freitag, 23. Juli 2021 um 17:24 schrieben Sie: > > > [...]Similar, once you reach the BACKUP_END record of > > a backup you know the database is consistent- but only if all of the > > files that PG was writing to were copied between the start checkpoint > > of the backup and the BACKUP_END record in the WAL. > > So I am correct that the recovery process starts at what is documented > as CHECKPOINT LOCATION in the file backup_label? If the data directory > looks like a more current checkpoint might be available, that is NOT > used? Instead, all available WALs are replayed? Yes, the recovery has to start from the checkpoint in the backup label. The fact that pg_control might have a more recent checkpoint is exactly the problem- any files which were copied before the checkpoint in pg_control wouldn't have the correct contents and wouldn't be fixed by the recovery process. Furthermore, if PG were to stop replaying before getting to the correct backup-end location then the database might not be consistent at that point because some records had been written into the WAL as committed but hadn't made it to the heap yet. There's risk on both sides. > But how come the state of other data files into the recovery process? > that's what Laurenz Albe's example is about: A data file which is more > current than expected because that file is only copied, but the > associated changes are stored in WALs after pg_stop_backup and > therefore need to be considered lost. I'm not sure that you're quite understanding the example because we always write to WAL first and sync that before writing to the heap, so there isn't a case of a "more recent" file which is somehow ahead of what's in the WAL. > Why is the data file with that additional row in B used at all? The > additional row is associated with a transaction number not available > when restoring based on CHECKPOINT LOCATION + WALs of pg_stop_backup. The data file is copied because it's part of the database..? If you didn't copy it, then you'd be missing a lot of data. As for the question about if that row would be visible, it certainly might be- but it depends on when the transaction log files were copied, or if perhaps the tuple ended up being frozen somehow. > So why is that row used at all instead of simply ignored? That would > actually revert the cluster to the state of pg_stop_backup, but as > said, backup outdate anyway. I still don't get why the result is not > as consistent as e.g. a crash at the same point in time instead of > doing a backup. How would you know to ignore it if that tuple was frozen? A crashed PG cluster isn't consistent until WAL replay from the last checkpoint to the ending point of WAL has completed. That's exactly the same with a backup, but the checkpoint you have to start from is the one stored in the backup label and the point in the WAL that has to be reached for consistency is the one that's at the backup-end location in the WAL. Otherwise they're both the same. The reason you have to start from the start-backup checkpoint is exactly because the data files are being copied over a period of time and not all at once- unlike with a crash where everything stops at the same time (and if that doesn't happen, you definitely do have the risk of having corruption from such a partial-crash, though we've tried our best to detect such cases and deal with them, but we need the kernel to actually tell us when there's a problem and that's been something of an interesting challenge..). > Mit freundlichen Grüßen > > Thorsten Schöning Seriously, when your signature is twice the size of your email, it's too much. Please cut it down when posting to these lists, we certainly don't need hundreds of copied of it in our archives or everyone's mail boxes. Thanks, Stephen
Attachment:
signature.asc
Description: PGP signature