Re: root cause of corruption in hot standby

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On Oct 10, 2018, at 10:15 AM, Mike Broers <mbroers@xxxxxxxxx> wrote:
> 
> 
> I'll look into rsync checksums, but this corruption presented itself during a time when streaming replication was working fine and it wasnt restoring archived rsynced transaction logs, and hadnt done so for around 30 hours.  The table it complained about it is accessed every minute with updates and monitoring so I dont think it would have taken so long if it was due to the application of a corrupted wal. 
> 

I think you missed my point.  If you are dealing with some sort of bit rot and/or data corruption on your storage device you need to sort of prove it which is very difficult to do.

You have WAL files on primary and the same WAL files on the replica via your rsync copy job.  If you check and recheck all the WALs daily to see if any of the files are changing and find a difference than proves that there is some sort of corruption/bit rot occurring as the WAL files are static files.

I’ve seen this type of corruption before with RAID controllers that are over taxed; where they would corrupt over time periodically.  I ended up changing from a RAID configuration to a JBOD and managing the disks via ZFS instead and never again experience data corruption using the exact same hardware.  ZFS also detects bit rot and correct for it as well as the ability to scrub the pool to ensure the disks are not slowly rotting away.

What storage system is being used? Does it have any measures to prevent bit rot? What is the RAID configuration? I would not recommend RAID 5 for a database; under heavy load the performance degradation and increase likelihood of data corruption is not worth it.

It sounds like you have some sort of environmental issues which is corrupting your data and it is not a Postgres issue.  The problem you face is that without some sort of definitive poof you’ll enter the realm of finger pointing… it’s a database issue, no is storage issue, etc.

You have two replicas; one periodically fails and the other does not — the only difference is the environment in which they operate.




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux