md dropping disks too early (was: Use RAID-6!)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The purpose of my RAID system is 1) to protect against hardware disk failures, both that a harddrive is entirely broken and won't read at all anymore. I know that this *will* happen at some point, but it's still a fairly rare event. The chance that 2 out of 8 drives go bad *in the same week* (!) is very small.

I am also concerned about 2) bit errors and silently broken sectors, and want my RAID to detect and fix those. I am not sure that Linux md does that.

There is a good chance that a controller or some wiring is bad, and many disks fail at the same time. Neither RAID5 nor RAID6 will protect against that, but a re-cabling should fix it without data loss, as the data on the disks is not affected.

Given that this RAID array is for my personal use, and the amount of disk slots in a machine is limited, and drives need 24/7 power, too, a RAID5 is the right choice for me, given the above situation.

---

BUT - and this is the main purpose of my post - Linux md causes problems by itself:

In my case, and from what I read in other posts in forums and on this mailing lists, many people have the problem that Linux md simply drops a disk from the RAID5, even though there was NOT an unrecoverable hardware failure. There are many situations where this happens:

1. Upgrade (my case)
2. Disk temporarily not accessible
3. Disk has bad sectors (but the other content can still be read)

None of these should be fatal. But it seems that md marks the disk as faulty and requires a resync. There does not seem to be any way to get a disk that was once marked spare or faulty back into the array, unless I do a resync. (If somebody knows a way, please show me, see thread 'Disk wrongly marked "spare", need to force re-add it'.) Now, the resync needs to read all data from all disks and can be the event that uncovers a problem with one of the other disks. That disk is then dropped as well, again with no way to re-add, and the array is entirely lost. However, that is completely unnecessary, given that there are often only a few bad sectors, and these - while bad - are no reason to say goodbye to several TB of data.

Essentially, by being overly cautious with the data and dropping disks too early and being too instant about it, md actually achieves the opposite of what it was made for. It was intended to protect my data against disk problems, but md actually makes minor or even temporary problems resulting in a total dataloss.

I'm not overstating, because that's the exact situation I am in right now. I have only 1 disk that's actually failing, and a RAID5, so in theory I am fine. But I see no way to safely get at my data anymore. My array is offline and I have no idea how to get it online again without risking to lose all data.

And worst: the whole situation was triggered by md dropping a disk from the array that is wasn't even failing, but just because I upgraded. :-(

Ben

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux