The purpose of my RAID system is 1) to protect against hardware disk
failures, both that a harddrive is entirely broken and won't read at all
anymore. I know that this *will* happen at some point, but it's still a
fairly rare event. The chance that 2 out of 8 drives go bad *in the same
week* (!) is very small.
I am also concerned about 2) bit errors and silently broken sectors, and
want my RAID to detect and fix those. I am not sure that Linux md does that.
There is a good chance that a controller or some wiring is bad, and many
disks fail at the same time. Neither RAID5 nor RAID6 will protect
against that, but a re-cabling should fix it without data loss, as the
data on the disks is not affected.
Given that this RAID array is for my personal use, and the amount of
disk slots in a machine is limited, and drives need 24/7 power, too, a
RAID5 is the right choice for me, given the above situation.
---
BUT - and this is the main purpose of my post - Linux md causes problems
by itself:
In my case, and from what I read in other posts in forums and on this
mailing lists, many people have the problem that Linux md simply drops a
disk from the RAID5, even though there was NOT an unrecoverable hardware
failure. There are many situations where this happens:
1. Upgrade (my case)
2. Disk temporarily not accessible
3. Disk has bad sectors (but the other content can still be read)
None of these should be fatal. But it seems that md marks the disk as
faulty and requires a resync. There does not seem to be any way to get a
disk that was once marked spare or faulty back into the array, unless I
do a resync. (If somebody knows a way, please show me, see thread 'Disk
wrongly marked "spare", need to force re-add it'.) Now, the resync needs
to read all data from all disks and can be the event that uncovers a
problem with one of the other disks. That disk is then dropped as well,
again with no way to re-add, and the array is entirely lost. However,
that is completely unnecessary, given that there are often only a few
bad sectors, and these - while bad - are no reason to say goodbye to
several TB of data.
Essentially, by being overly cautious with the data and dropping disks
too early and being too instant about it, md actually achieves the
opposite of what it was made for. It was intended to protect my data
against disk problems, but md actually makes minor or even temporary
problems resulting in a total dataloss.
I'm not overstating, because that's the exact situation I am in right
now. I have only 1 disk that's actually failing, and a RAID5, so in
theory I am fine. But I see no way to safely get at my data anymore. My
array is offline and I have no idea how to get it online again without
risking to lose all data.
And worst: the whole situation was triggered by md dropping a disk from
the array that is wasn't even failing, but just because I upgraded. :-(
Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html