Greetings, * Edson Carlos Ericksson Richter (richter@xxxxxxxxxxxxxx) wrote: > Would be possible to include in future versions: > 1) After start standby, standby run all WAL files until it is > synchronized with master (current behavior) > 3) Before getting into "accept read only queries", check if all base > files have same size as master server (new behavior). In case > something is different, throw an error and stop database startup? > 4) Then start "accept read only queries" (current behavior) I'm afraid it wouldn't be nearly that simple with a running system. If you're that concerned about it, you could shut down the primary, verify that the replica has fully caught up to where the primary was, and then compare the file sizes on disk using a script yourself (excluding the temporary files and unlogged relations, etc, of course). I'm not completely sure that there isn't a valid reason for them to be different even with everything shut down and fully caught up, but that seems like the best chance way to match things up. On a running system where lots of processes are potentially writing to different files across the system, getting a point-in-time snapshot of the file sizes across the entire system as of a certain WAL point (so you could replay to there and then compare on the replica) seems unlikely to be possible, at least with just PG. Perhaps with help from the OS/filesystem, you might be able to get such a snapshot. Do you have some way of knowing that such a check would have actually caught this..? Isn't it possible that the file became zero'd out sometime after the replica started up and was running fine for a while? One of the questions I asked earlier drove at exactly this question- why weren't there any errors thrown during WAL replay on the replica, given that the relation clearly appeared to be modified after the replica was built? Another thing which could be done would be to look through the WAL for references to that file and see what the WAL included for it. Also, really, if this is happening then there's something pretty wrong with some part of the process and that simply needs to be fixed- just throwing an error saying the replica isn't valid isn't really good for anyone. Where the issue is, I don't think we have any idea, because there's a lot of details missing here, which is why I was asking questions about the other data and information since it might help us figure out what happened. Perhaps it's an issue in the OS or filesystem or another layer (though there should really be logs if that's the case) or maybe it really is a PG issue, but if so, we need a lot more info to debug and address it. Thanks! Stephen
Attachment:
signature.asc
Description: Digital signature