Neil Brown wrote on Sat, Mar 18, 2006 at 08:13:48AM +1100: > On Friday March 17, chris@xxxxxxx wrote: > > Dear All, > > > > We have a number of machines running 4TB raid5 arrays. > > Occasionally one of these machines will lock up solid and > > will need power cycling. Often when this happens, the > > array will refuse to restart with 'cannot start dirty > > degraded array'. Usually mdadm --assemble --force will > > get the thing going again - although it will then do > > a complete resync. First of all you need to make sure you can see the kernel messages from this. If /var/log/messages lives on the array affected you won't see messages explaining what happens even if the kernel printed them. What you see here is probably similar to a problem I just had: by using software RAID you are subject to errors below the RAID level that are not disk errors. In my case a BIOS problem on my board made the SATA driver run out of space, on requests for two of the disks on my RAID-5, simultaneously. The driver had to report an error upstream and the RAID software on top of it cannot tell such a non-disk error from a disk error. It treats everything as a disk error and drops the disk out of the array because it has seen errors on requests for two disks. I have more info on my accident here: http://forums.2cpu.com/showthread.php?t=73705 As I said, you need to have a logfile on a disk not in the array, or (better) you need to be able to watch kernel messages on the console when this happens. It sounds to me you have a similar problem to what I had: a software error above the disks but below the raid level. > > > > > > My question is: Is there any way I can make the array > > more robust? I don't mind it losing a single drive and > > having to resync when we get a lockup - but having to > > do a forced assemble always makes me nervous, and means > > that this sort of crash has to be escalated to a senior > > engineer. The re-sync is actually a big problem because actually losing a drive physically during the re-sync will kill your array (unless it is the re-syncing disk). Martin -- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Martin Cracauer <cracauer@xxxxxxxx> http://www.cons.org/cracauer/ FreeBSD - where you want to go, today. http://www.freebsd.org/ - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html