Re: Does PostgreSQL check database integrity at startup?

Stephen Frost <sfrost@xxxxxxxxxxx> · Thu, 28 Dec 2017 13:16:16 -0500

Greetings,

* Edson Carlos Ericksson Richter (richter@xxxxxxxxxxxxxx) wrote:
> Would be possible to include in future versions:
> 1) After start standby, standby run all WAL files until it is
> synchronized with master (current behavior)
> 3) Before getting into "accept read only queries", check if all base
> files have same size as master server (new behavior). In case
> something is different, throw an error and stop database startup?
> 4) Then start "accept read only queries" (current behavior)

I'm afraid it wouldn't be nearly that simple with a running system.  If
you're that concerned about it, you could shut down the primary, verify
that the replica has fully caught up to where the primary was, and then
compare the file sizes on disk using a script yourself (excluding the
temporary files and unlogged relations, etc, of course).  I'm not
completely sure that there isn't a valid reason for them to be different
even with everything shut down and fully caught up, but that seems like
the best chance way to match things up.

On a running system where lots of processes are potentially writing to
different files across the system, getting a point-in-time snapshot of
the file sizes across the entire system as of a certain WAL point (so
you could replay to there and then compare on the replica) seems
unlikely to be possible, at least with just PG.  Perhaps with help from
the OS/filesystem, you might be able to get such a snapshot.

Do you have some way of knowing that such a check would have actually
caught this..?  Isn't it possible that the file became zero'd out
sometime after the replica started up and was running fine for a while?
One of the questions I asked earlier drove at exactly this question- why
weren't there any errors thrown during WAL replay on the replica, given
that the relation clearly appeared to be modified after the replica was
built?  Another thing which could be done would be to look through the
WAL for references to that file and see what the WAL included for it.

Also, really, if this is happening then there's something pretty wrong
with some part of the process and that simply needs to be fixed- just
throwing an error saying the replica isn't valid isn't really good for
anyone.  Where the issue is, I don't think we have any idea, because
there's a lot of details missing here, which is why I was asking
questions about the other data and information since it might help us
figure out what happened.  Perhaps it's an issue in the OS or filesystem
or another layer (though there should really be logs if that's the case)
or maybe it really is a PG issue, but if so, we need a lot more info to
debug and address it.

Thanks!

Stephen
Attachment:
signature.asc

Description: Digital signature