On Wed, 23 Apr 2008, Maurice Hilarius wrote:
Hi all.
With much appreciated help from Bell Davidsen and Justin Piszcz I recently
dealt with a problem with a RAID1 set, caused by a failing hard disk.
At the end, there is one question remaining, which I think is quite
important:
When one has a RAID5 or RAID6, and a disk starts "acting up" mdadm rapidly
kicks out the offending device.
Some might say "too easily" but that is another thread.
On a RAID1 set, until the failing disk completely "packs it in" it remains
part of the RAID.
Why??
Some more background:
Since the issue was reported and explored I have recreated this on a test
machine.
Installed RAID1 with one known good and one know error prone drive.
Easy to do as the error drive has a thermal issue.
Keep it cold, no problems, but after 30 minutes use in a +25C room it start
to generate data errors.
I reproduced exactly the problem I saw before:
Data errors occur, the other drive in the RAID1 set gets "infected" with the
bad data, and the file system will get corrupted.
On BOTH drives.
This is highly reproducible.
In summary:
1) RAID1 lacks significant protection from the effects of a data error
condition on a failing drive
2) I recommend anyone using madadm refrain from using RAID1 until this issue
is addressed and resolved.
Thanks again.
I can confirm this, until you actually REBOOT the host with RAID1 only
then will it kick it out. Whereas with RAID5, I experienced the same
thing, it kicks it out right away, would need to wait for the
linux-raid/developers to answer this one.
Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html