Agreed, no good threads to pull for definite answers. Ive added a checksum to the rsync as a precaution. The SAN is the same between the two hosts. One is running on a VMWare server, the other is on a Hyper-V host. We have a few other postgres database hosts on that same hyper-v infrastructure and they all check out fine as far as I can tell.
The only thing I could guess at this point is that although primary and replica are running on ext4 under LVM, the replica data volumes were grown online at some point because we were intending to cutover into an environment with more space. This happened a while back and doesnt correspond to any of the times the checksum errors occurred in the logs, and Ive started with fresh pg_basebackups twice since then, so Im skeptical that is a cause, but its all I've got.
There may be some additional hyper-v fsync write-thru cache configurations I'm unaware of as well. That is my fear at this point.
Thanks for following up again.
On Wed, Oct 10, 2018 at 10:22 AM Rui DeSousa <rui@xxxxxxxxxxxxx> wrote:
> On Oct 10, 2018, at 10:15 AM, Mike Broers <mbroers@xxxxxxxxx> wrote:
>
>
> I'll look into rsync checksums, but this corruption presented itself during a time when streaming replication was working fine and it wasnt restoring archived rsynced transaction logs, and hadnt done so for around 30 hours. The table it complained about it is accessed every minute with updates and monitoring so I dont think it would have taken so long if it was due to the application of a corrupted wal.
>
I think you missed my point. If you are dealing with some sort of bit rot and/or data corruption on your storage device you need to sort of prove it which is very difficult to do.
You have WAL files on primary and the same WAL files on the replica via your rsync copy job. If you check and recheck all the WALs daily to see if any of the files are changing and find a difference than proves that there is some sort of corruption/bit rot occurring as the WAL files are static files.
I’ve seen this type of corruption before with RAID controllers that are over taxed; where they would corrupt over time periodically. I ended up changing from a RAID configuration to a JBOD and managing the disks via ZFS instead and never again experience data corruption using the exact same hardware. ZFS also detects bit rot and correct for it as well as the ability to scrub the pool to ensure the disks are not slowly rotting away.
What storage system is being used? Does it have any measures to prevent bit rot? What is the RAID configuration? I would not recommend RAID 5 for a database; under heavy load the performance degradation and increase likelihood of data corruption is not worth it.
It sounds like you have some sort of environmental issues which is corrupting your data and it is not a Postgres issue. The problem you face is that without some sort of definitive poof you’ll enter the realm of finger pointing… it’s a database issue, no is storage issue, etc.
You have two replicas; one periodically fails and the other does not — the only difference is the environment in which they operate.