Edson, * Edson Carlos Ericksson Richter (richter@xxxxxxxxxxxxxx) wrote: > Em 27/12/2017 15:02, Stephen Frost escreveu: > >* Alvaro Herrera (alvherre@xxxxxxxxxxxxxxx) wrote: > >>Stephen Frost wrote: > >> > >>>* Edson Carlos Ericksson Richter (richter@xxxxxxxxxxxxxx) wrote: > >>>>Anyway, instead digging into rsync functionality (or bugs - I doubt, > >>>>but who knows?), I do prefer to have a script I can run to check if > >>>>there is obvious failures in standby servers. > >>>As mentioned, zero-byte files can be perfectly valid. PostgreSQL does > >>>have page-level CRCs, if you initialized your database with them (which > >>>I would strongly recommend). > >>Page-level checksums would not detect the problem being complained in > >>this thread, however. > >It's entirely unclear to me what the problem being complained about in > >this thread actually is. The complaint so far was about zero-byte > >files, but those are entirely valid, so that isn't a problem that anyone > >can solve.. > > > >Given the thread subject, if someone actually wanted to do a database > >integrity check before startup, they could use pgBackRest to perform a > >backup with a CRC-enabled database and at least verify that all of the > >checksums are valid. > > > >We could possibly look into adding some set of additional checks for > >files which can't actually be zero-byte, perhaps.. I know we have some > >other one-off checks already. > > Actually, the problem is: > > Master => Slave => Backup > > In the master server everything is fine. Alright.. > But at some point in time, slave became corrupt (one of the base > files are zero size where it should be 16Mb in size), and IMHO a > "red alert" should arise - Slave server shall not even startup at > all. How do you know it should be 16Mb in size...? That sounds like you're describing a WAL file, but you should be archiving your WAL files during a backup, not just using whatever is in pg_xlog/pg_wal.. > Since backups are taken from slave server, all backups are also corrupt. If you aren't following the appropriate process to perform a backup then, yes, you're going to end up with corrupt and useless/bad backups. Backing up from a replica has only been officially supported using the pg_start/stop_backup methods as of 9.6 and only when doing a non-exclusive backup. Note that the wiki page you're talking about (I think, anyway...) is describing *exclusive* backup, not non-exclusive, and the two are not the same. > I've detected the problem just because I've restored a backup > (excellent practice perhaps - nobody should take backups if not > testing it with the restore procedure). Yes, restoring a backup is excellent practice and something that everyone really should be doing. In my view, at least, everyone should also be using well tested backup tools instead of trying to write their own. > In slave server there is no indication that the database is corrupt > (not in logs, it starts normally and show it is applying stream > changes regularly). This sounds like what's happening is that you're ending up with PG thinking that a crash happened because backup_label is missing, which will happen if you do pg_start/stop_backup on the replica and don't make sure to take the results from pg_stop_backup and create the backup_label file before starting PostgreSQL after the restore. This isn't something that would happen if you used a backup tool that knew about how to perform a non-exclusive backup or how to properly backup using a replica instead of trying to write your own without understanding how all these pieces play together. > So that is the point: how to detect that a database is corrupt so > cluster doesn't even start... I'm not sure that the database is actually corrupt in this specific case- it sounds like everything is actually fine, but you didn't include the backup_label file when restoring and therefore PG thinks there is crash recovery happening when it should be replaying WAL from the start of the backup, but how is PG going to know that? Well, it'd know that from the backup_label file, if it was there.. Otherwise, it seems pretty difficult for us to know that we're not doing crash recovery. I'm certainly open to ideas on how to detect that, but nothing springs to mind off-hand. Again, using a backup tool instead of trying to roll your own would make this much less likely to happen. This isn't something which should happen when you're using pg_basebackup or the other PG-specific backup tools, rather it's only happening because you're trying to do your own with pg_start/stop_backup and rsync and didn't completely read the documentation on non-exclusive backups from replicas (that said, those docs could certainly use improvment...). Of course, perhaps I'm misunderstanding exactly what you're doing or what file you're referring to, but this is my best guess based on the information you've provided so far. Thanks! Stephen
Attachment:
signature.asc
Description: Digital signature