Re: root cause of corruption in hot standby

Mike Broers <mbroers@xxxxxxxxx> · Wed, 10 Oct 2018 09:15:16 -0500

The replica is instantiated with a pg_basebackup, and seems to run fine for a few days before the checksum error presents itself.  Initially it ran a few months without issue.  The replica vm was created in May and it ran until September without the checksum error.  This time it was 12 days after a fresh pg_basebackup. 
I'll look into rsync checksums, but this corruption presented itself during a time when streaming replication was working fine and it wasnt restoring archived rsynced transaction logs, and hadnt done so for around 30 hours.  The table it complained about it is accessed every minute with updates and monitoring so I dont think it would have taken so long if it was due to the application of a corrupted wal. 

Id like to know if there are diagnostics I can turn to validate the VM and its configuration..  Checking the usual logging in /var/log and dmesg isnt showing anything, or chkdsk..  

On Tue, Oct 9, 2018 at 7:56 PM Rui DeSousa <rui@xxxxxxxxxxxxx> wrote:

> On Oct 9, 2018, at 1:04 PM, Mike Broers <mbroers@xxxxxxxxx> wrote:

> 

> Ok so I have checksum errors in this replica AGAIN.

Mike,

I don’t think you are dealing with a “Postgres” issue but possibly bit rot from either faulty hardware or a misconfiguration in your stack.

If you recall the archive WAL file was originally corrupted.  Replicating the WAL files is outside the functionally of Postgres thus it would either be a file replication issue, bit rot, or some other data corruption issue but not Postgres bug.

This leaves me with the follow two points:

1. How was the replica instance instantiated? I would assume from your backup procedures as your backups should be used to help validate them.

2. Are there currently any WAL files that are corrupt?  You can quickly check using rsync with the “—checksum" option but don’t fix the file on the target but instead use "—dry-run" just to identify which files might have changed first.  I would check this every day until the issue is fully resolved.

 i.e. rsync --archive --checksum --verbose --dry-run {source_wals}  {replica_wals}

Since you’re confident that you resolved the potential rsync race condition in archiving the WAL files we shouldn’t see any differences between WALs that have already been transmitted.  If we do find WALs that are different then you’re dealing with data corruption on the replica and need to start looking into your stack and storage system; However, if you don’t find any corrupted WALs then question 1 needs to be scrutinized and you really need to ensure your backups are rock solid.

I wouldn’t bother rebuilding the VM instance until the problem is identified — unless you’re moving it to an all new hardware stack.

P.s. Is there any anti-virus software running on the the server or any other software that might modify files on your behalf?