Re: Does PostgreSQL check database integrity at startup?

Stephen Frost <sfrost@xxxxxxxxxxx> · Thu, 28 Dec 2017 07:16:51 -0500

Alvaro,

* Alvaro Herrera (alvherre@xxxxxxxxxxxxxx) wrote:
> For context: this was first reported in the Barman forum here:
> https://groups.google.com/forum/#!msg/pgbarman/3aXWpaKWRFI/weUIZxspDAAJ
> They are using Barman for the backups.

Ahhhh, I see.  I wasn't aware of that history.

> Stephen Frost wrote:
> 
> > > But at some point in time, slave became corrupt (one of the base
> > > files are zero size where it should be 16Mb in size), and IMHO a
> > > "red alert" should arise - Slave server shall not even startup at
> > > all.
> > 
> > How do you know it should be 16Mb in size...?  That sounds like you're
> > describing a WAL file, but you should be archiving your WAL files during
> > a backup, not just using whatever is in pg_xlog/pg_wal..
> 
> It's not a WAL file -- it's a file backing a table.

Interesting.

> > > Since backups are taken from slave server, all backups are also corrupt.
> > 
> > If you aren't following the appropriate process to perform a backup
> > then, yes, you're going to end up with corrupt and useless/bad backups.
> 
> A few guys went over the backup-taking protocol upthread already.
> 
> But anyway the backup tool is a moot point.  The problem doesn't
> originate in the backup -- it originates in the standby, from where the
> backup is taken.  The file can be seen as size 0 in the standby.
> Edson's question is: why wasn't the problem detected in the standby?
> It seems a valid question to me, to which we currently we don't have any
> good answer.

The last message on that thread seems pretty clear to me- the comment is
"I think this is a failure in standby build."  It's not clear what that
failure was but I agree it doesn't appear related to the backup tool
(the comment there is "I'm using rsync"), or, really, PostgreSQL at all
(a failure during the build of the replica isn't something we're
necessairly going to pick up on..).

As discussed on this thread, zero-byte files are entirely valid to
appear in the PostgreSQL data directory.

To try and dig into what happened, I'd probably look at what forks there
are of that relation, the entry in pg_class, and try to figure out how
it is that replication isn't complaining when the file on the primary
appeared to be modified well after the last modify timestamp on the
replica.  If it's possible to replica this into a test environment,
maybe even do a no-op update of a row of that table and see what happens
with replication.  One thing I wonder is if this table used to be
unlogged and it was later turned into a logged table but something
didn't quite happen correctly with that.  I'd also suggest looking for
other file size mismatches between the primary and the replica.

Thanks!

Stephen
Attachment:
signature.asc

Description: Digital signature