error handling for cluster/replicate across 2 or more nodes

centos.admin at gmail.com (Emmanuel Noobadmin) · Sun, 11 Jul 2010 01:45:54 +0800

On 7/10/10, John Preston <byhisdeeds at gmail.com> wrote:
> I ask this because in a previous installation where I was running 1TB of
> RAID1 (2 disks) under Centos 5, I noticed that after a few days the RAID
> disk would go into readonly mode. When I unmounted it and rand a disk check

They don't happen to be Seagate 1TB drives do they? Last year I had to
spend a weekend rebuilding a LVM on a 4 of them (2x RAID 1).

> (fsck) it reported a whole bunch of errors of shared nodes, etc, all which
> it fixed. However, after putting the disk back online the same problem
> occurred a couple of days later. I checked the RAID disks independently for
> bad blocks, etc and one of them was reporting some errors which were fixed.
> In the end I had to replace the disk that kept reporting some errors, even
> though I was told that they were fixed. When I did this, I no longer had the
> problem.
>
> So you can see what I'm trying to understand how gluster might deal with
> such a case, if errors occurring in one of the replicated nodes would find
> its way into the other nodes and corrupt them.

This sounds bad if bad sectors can corrupt good copies despite the
drives reporting hard errors. Unfortunately, it's the weekend so
replies tend to be slow on mailing lists in general. Hopefully
somebody with indepth knowledge can chip in on Monday.