Re: making raid5 more robust after a crash?

Neil Brown <neilb@xxxxxxx> · Sat, 18 Mar 2006 08:13:48 +1100

On Friday March 17, chris@xxxxxxx wrote:
> Dear All,
> 
> We have a number of machines running 4TB raid5 arrays.
> Occasionally one of these machines will lock up solid and
> will need power cycling. Often when this happens, the
> array will refuse to restart with 'cannot start dirty
> degraded array'. Usually  mdadm --assemble --force will
> get the thing going again - although it will then do
> a complete resync.
> 
> 
> My question is: Is there any way I can make the array
> more robust? I don't mind it losing a single drive and
> having to resync when we get a lockup - but having to
> do a forced assemble always makes me nervous, and means
> that this sort of crash has to be escalated to a senior
> engineer.

Why is the array degraded?

Having a crash while the array is degraded can cause undetectable data
loss.  That is why md won't assemble the array itself: you need to
know there could be a problem.

But a crash with a degraded array should be fairly unusual.  If it is
happening a lot, then there must be something wrong with your config:
either you are running degraded a lot (which is not safe, don't do
it), or md cannot find all the devices to assemble.
> 
> 
> Typical syslog:
> 
> 
> Mar 17 10:45:24 snap27 kernel: md: Autodetecting RAID arrays.
> Mar 17 10:45:24 snap27 kernel: md: autorun ...
> Mar 17 10:45:24 snap27 kernel: md: considering sdh1 ...
> Mar 17 10:45:24 snap27 kernel: md:  adding sdh1 ...
> Mar 17 10:45:24 snap27 kernel: md:  adding sdg1 ...
> Mar 17 10:45:24 snap27 kernel: md:  adding sdf1 ...
> Mar 17 10:45:24 snap27 kernel: md:  adding sde1 ...
> Mar 17 10:45:24 snap27 kernel: md:  adding sdd1 ...
> Mar 17 10:45:24 snap27 kernel: md:  adding sdc1 ...
> Mar 17 10:45:24 snap27 kernel: md:  adding sda1 ...
> Mar 17 10:45:24 snap27 kernel: md: created md0
> Mar 17 10:45:24 snap27 kernel: md: bind<sda1>
> Mar 17 10:45:24 snap27 kernel: md: bind<sdc1>
> Mar 17 10:45:24 snap27 kernel: md: bind<sdd1>
> Mar 17 10:45:24 snap27 kernel: md: bind<sde1>
> Mar 17 10:45:24 snap27 kernel: md: bind<sdf1>
> Mar 17 10:45:24 snap27 kernel: md: bind<sdg1>
> Mar 17 10:45:24 snap27 kernel: md: bind<sdh1>
> Mar 17 10:45:24 snap27 kernel: md: running: <sdh1><sdg1><sdf1><sde1><sdd1><sdc1><sda1>
> Mar 17 10:45:24 snap27 kernel: md: md0: raid array is not clean -- starting background reconstruction
> Mar 17 10:45:24 snap27 kernel: raid5: device sdh1 operational as raid disk 4
> Mar 17 10:45:24 snap27 kernel: raid5: device sdg1 operational as raid disk 5
> Mar 17 10:45:24 snap27 kernel: raid5: device sdf1 operational as raid disk 6
> Mar 17 10:45:24 snap27 kernel: raid5: device sde1 operational as raid disk 7
> Mar 17 10:45:24 snap27 kernel: raid5: device sdd1 operational as raid disk 3
> Mar 17 10:45:24 snap27 kernel: raid5: device sdc1 operational as raid disk 2
> Mar 17 10:45:24 snap27 kernel: raid5: device sda1 operational as raid disk 0
> Mar 17 10:45:24 snap27 kernel: raid5: cannot start dirty degraded
> array for md0

So where is 'disk 1' ??  Presumably it should be 'sdb1'.  Does that
drive exist?  Is is marked for auto-detect like the others?

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html