Hi, On Tue, 2003-02-04 at 16:43, Bodrogi Viktor wrote: > This really breaks my confidence in RAID-1 mirrors. Why? RAID1 is there to deal with disk errors. Not controller errors or memory errors. The data on the disk is protected by a CRC, so disk reads themselves should not fail silently --- ie. if it returns bad data, you'll get the CRC failure and RAID1 will fail over to the other disk. For SCSI, you have cable parity; for IDE, you have CRCs again (at least in UDMA mode), so the transfer from disk to controller is once again protected against silent data loss. So if something goes wrong there, the OS is likely to hear about it, take the disk offline, and failover transparently to the other disk. I've found that the vast majority of cases where you get silent data corruption, the corruption is occurring in system memory. It's either between the controller and the CPU, or between CPU and main memory, or it's bad memory on the mboard or in the CPU cache. And you just can't protect against that, short of going for something like the massively redundant fault-tolerant systems like Himalaya which run multiple instances of the CPU in lock-step and use majority voting to detect misbehaviours. > Would the situation get better with a four disk RAID-5? No, RAID-5 is sometimes even more sensitive to such problems, because it has the ability to reconstruct one disk from the contents of the other --- and so, if one disk goes offline, then silent corruptions of the other disks can cause it to reconstruct the wrong data for the missing disk. Remember, there's a huge difference between silent errors, and errors which are detected and dealt with intelligently. RAID of all varieties assumes that when data on disk gets corrupted, you get to hear about it rather than silently being given bad data; and because of sector CRCs on disk, that is usually a valid assumption. > I prefer definitive errors than unknown failures. > Then it gets show up as a disk error, not as random segfaults. > > If this phenomena is HW error, should it be logged anywhere? > I didn't find anything in syslog... You tell me! It _could_ be just about anything. If it's a sector IO failure, it will be logged. If it's main memory silently corrupting data because of bad ram, it won't be --- run memtest86 to try to locate it. Cheers, Stephen _______________________________________________ Ext3-users@redhat.com https://listman.redhat.com/mailman/listinfo/ext3-users