Need help recovering RAID5 array

Stephen Muskiewicz <stephen_muskiewicz@xxxxxxx> · Fri, 5 Aug 2011 11:27:06 -0400

Hello,

I'm hoping to figure out how I can recover a RAID5 array that suddenly 
won't start after one of our servers took a power hit.
I'm fairly confident that all the individual disks of the RAID are OK 
and that I can recover my data (without having to resort to asking my 
sysadmin to fetch the backup tapes), but despite my extensive Googling 
and reviewing the list archives and mdadm manpage, so far nothing I've 
tried has worked.  Hopefully I am just missing something simple.

Background: The server is a Sun X4500 (thumper) running CentOS 5.5.  I 
have confirmed using the (Sun provided) "hd" utilities that all of the 
individual disks are online and none of the device names appear to have 
changed from before the power outage.  There are also two other RAID5 
arrays as well as the /dev/md0 RAID1 OS mirror on the same box that did 
come back cleanly (these have ext3 filesystems on them, the one that 
failed to come up is just a raw partition used via iSCSI if that makes 
any difference.)  The array that didn't come back is /dev/md/51, the 
ones that did are /dev/md/52 and /dev/md/53.  I have confirmed that all 
three device files do exist in /dev/md.  (/dev/md51 is also a symlink to 
/dev/md/51, as are /dev/md52 and /dev/md53 for the working arrays).  We 
also did quite a bit of testing on the box before we deployed the arrays 
and haven't seen this problem before now, previously all of the arrays 
came back online as expected.  Of course it has also been about 7 months 
since the box has gone down but I don't think there were any major 
changes since then.

When I boot the system (tried this twice including a hard power down 
just to be sure), I see "mdadm: No suitable drives found for /dev/md51". 
 Again the other 2 arrays come up just fine.  I have checked that the 
array is listed in /etc/mdadm.conf

(I will apologize for a lack of specific mdadm output in my details 
below, the network people have conveniently (?) picked this weekend to 
upgrade the network in our campus building and I am currently unable to 
access the server until they are done!)

"mdadm --detail /dev/md/51" does (as expected?) display: "mdadm: md 
device /dev/md51 does not appear to be active"

I have done an "mdadm --examine" on each of the drives in the array and 
each one shows a state of "clean" with a status of "U" (and all of the 
other drives in the sequence shown as "u").  The array name and UUID 
value look good and the "update time" appears to be about when the 
server lost power.  All the checksums read "correct" as well.  So I'm 
confident all the individual drives are there and OK.

I do have the original mdadm command used to construct the array. 
(There are 8 active disks in the array plus 2 spares.)  I am using 
version 1.0 metadata with the -N arg to provide a name for each array.
So I used this command with the assemble option (but without the -N or 
-u) options:

mdadm -A /dev/md/51 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 
/dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1

But this just gave the "no suitable drives found" message.

I retried the mdadm command using -N <name> and -u <UUID> options but in 
both cases saw the same result.

One odd thing that I noticed was that when I ran an:
mdadm --detail --scan

The output *does* display all three arrays, but the name of the arrays 
shows up as "ARRAY /dev/md/<arrayname>" rather than the "ARRAY 
/dev/md/NN" that I would expect (and that is in my /etc/mdadm.conf 
file).  Not sure if this has anything to do with the problem or not. 
There are no /dev/md/<arrayname> device files or symlinks on the system.

I *think* my next step based on the various posts I've read would be to 
try the same mdadm -A command with --force, but I'm a little wary of 
that and want to make sure I actually understand what I'm doing so I 
don't screw up the array entirely and lose all my data!  I'm not sure if 
I should be giving it *all* of the drives as an arg, including the 
spares or should I just pass it the active drives?  Should I use the 
--raid-devices and/or --spare-devices options?  Anything else I should 
include or not include?

Thanks in advance to any advice you can provide.  I won't be able to 
test until Monday morning but it would be great to be armed with things 
to try so I can hopefully get back up and running soon and minimize all 
of those "When will the network share be back up?" questions that I'm 
already anticipating getting.

Cheers,
-steve

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html