Alvaro, * Alvaro Herrera (alvherre@xxxxxxxxxxxxxx) wrote: > For context: this was first reported in the Barman forum here: > https://groups.google.com/forum/#!msg/pgbarman/3aXWpaKWRFI/weUIZxspDAAJ > They are using Barman for the backups. Ahhhh, I see. I wasn't aware of that history. > Stephen Frost wrote: > > > > But at some point in time, slave became corrupt (one of the base > > > files are zero size where it should be 16Mb in size), and IMHO a > > > "red alert" should arise - Slave server shall not even startup at > > > all. > > > > How do you know it should be 16Mb in size...? That sounds like you're > > describing a WAL file, but you should be archiving your WAL files during > > a backup, not just using whatever is in pg_xlog/pg_wal.. > > It's not a WAL file -- it's a file backing a table. Interesting. > > > Since backups are taken from slave server, all backups are also corrupt. > > > > If you aren't following the appropriate process to perform a backup > > then, yes, you're going to end up with corrupt and useless/bad backups. > > A few guys went over the backup-taking protocol upthread already. > > But anyway the backup tool is a moot point. The problem doesn't > originate in the backup -- it originates in the standby, from where the > backup is taken. The file can be seen as size 0 in the standby. > Edson's question is: why wasn't the problem detected in the standby? > It seems a valid question to me, to which we currently we don't have any > good answer. The last message on that thread seems pretty clear to me- the comment is "I think this is a failure in standby build." It's not clear what that failure was but I agree it doesn't appear related to the backup tool (the comment there is "I'm using rsync"), or, really, PostgreSQL at all (a failure during the build of the replica isn't something we're necessairly going to pick up on..). As discussed on this thread, zero-byte files are entirely valid to appear in the PostgreSQL data directory. To try and dig into what happened, I'd probably look at what forks there are of that relation, the entry in pg_class, and try to figure out how it is that replication isn't complaining when the file on the primary appeared to be modified well after the last modify timestamp on the replica. If it's possible to replica this into a test environment, maybe even do a no-op update of a row of that table and see what happens with replication. One thing I wonder is if this table used to be unlogged and it was later turned into a logged table but something didn't quite happen correctly with that. I'd also suggest looking for other file size mismatches between the primary and the replica. Thanks! Stephen
Attachment:
signature.asc
Description: Digital signature