Re: RAID 5 array recovery - two drives errors in external enclosure

Tim Bostrom <tbostrom@xxxxxxxxx> · Thu, 17 Sep 2009 16:11:32 -0700

I re-cabled the drives so that they show up as the same drive letter
as they were before when in the enclosure.

I then went ahead and tried your idea of restarting the array. I tried
this first:

mdadm -C /dev/md0 -l 5 -n 5 -c 256 /dev/sd[bcde]1 missing

mount -o ro /dev/md0 /mnt/teradata

/var/log/messages:
-----------------
Sep 17 16:07:09 tera kernel: md: bind<sdb1>
Sep 17 16:07:09 tera kernel: md: bind<sdc1>
Sep 17 16:07:09 tera kernel: md: bind<sdd1>
Sep 17 16:07:09 tera kernel: md: bind<sde1>
Sep 17 16:07:09 tera kernel: raid5: device sde1 operational as raid disk 3
Sep 17 16:07:09 tera kernel: raid5: device sdd1 operational as raid disk 2
Sep 17 16:07:09 tera kernel: raid5: device sdc1 operational as raid disk 1
Sep 17 16:07:09 tera kernel: raid5: device sdb1 operational as raid disk 0
Sep 17 16:07:09 tera kernel: raid5: allocated 5268kB for md0
Sep 17 16:07:09 tera kernel: raid5: raid level 5 set md0 active with 4
out of 5 devices, algorithm 2
Sep 17 16:07:09 tera kernel: RAID5 conf printout:
Sep 17 16:07:09 tera kernel: --- rd:5 wd:4
Sep 17 16:07:09 tera kernel: disk 0, o:1, dev:sdb1
Sep 17 16:07:09 tera kernel: disk 1, o:1, dev:sdc1
Sep 17 16:07:09 tera kernel: disk 2, o:1, dev:sdd1
Sep 17 16:07:09 tera kernel: disk 3, o:1, dev:sde1
Sep 17 16:07:56 tera kernel: EXT3-fs error (device md0):
ext3_check_descriptors: Block bitmap for group 8064 not in group
(block 532677632)!
Sep 17 16:07:56 tera kernel: EXT3-fs: group descriptors corrupted!
--------------------------------

I then tried a few more permutations of the command:
mdadm -C /dev/md0 -l 5 -n 5 -c 256 /dev/sd[bdce]1 missing
mdadm -C /dev/md0 -l 5 -n 5 -c 256 /dev/sd[bdec]1 missing
mdadm -C /dev/md0 -l 5 -n 5 -c 256 /dev/sd[becd]1 missing

Every time I changed the order, it would still print the order the
same in the log:

Sep 17 16:02:52 tera kernel: md: bind<sdb1>
Sep 17 16:02:52 tera kernel: md: bind<sdc1>
Sep 17 16:02:52 tera kernel: md: bind<sdd1>
Sep 17 16:02:52 tera kernel: md: bind<sde1>
Sep 17 16:02:52 tera kernel: raid5: device sde1 operational as raid disk 3
Sep 17 16:02:52 tera kernel: raid5: device sdd1 operational as raid disk 2
Sep 17 16:02:52 tera kernel: raid5: device sdc1 operational as raid disk 1
Sep 17 16:02:52 tera kernel: raid5: device sdb1 operational as raid disk 0
Sep 17 16:02:52 tera kernel: raid5: allocated 5268kB for md0
Sep 17 16:02:52 tera kernel: raid5: raid level 5 set md0 active with 4
out of 5 devices, algorithm 2
Sep 17 16:02:52 tera kernel: RAID5 conf printout:
Sep 17 16:02:52 tera kernel: --- rd:5 wd:4
Sep 17 16:02:52 tera kernel: disk 0, o:1, dev:sdb1
Sep 17 16:02:52 tera kernel: disk 1, o:1, dev:sdc1
Sep 17 16:02:52 tera kernel: disk 2, o:1, dev:sdd1
Sep 17 16:02:52 tera kernel: disk 3, o:1, dev:sde1

Am I doing something wrong?

On Thu, Sep 17, 2009 at 2:22 PM, Robin Hill <robin@xxxxxxxxxxxxxxx> wrote:
> On Thu Sep 17, 2009 at 01:42:30PM -0700, Tim Bostrom wrote:
>
>> OK,
>>
>> Let me start off by saying - I panicked.  Rule #1 - don't panic.  I
>> did.  Sorry.
>>
>> I have a RAID 5 array running on Fedora 10.
>> (Linux tera.teambostrom.com 2.6.27.30-170.2.82.fc10.i686 #1 SMP Mon
>> Aug 17 08:38:59 EDT 2009 i686 athlon i386 GNU/Linux)
>>
>> 5 drives in an external enclosure (AMS eSATA Venus T5).  It's a
>> Sil4726 inside the enclosure running to a Sil3132 controller via eSATA
>> in the desktop.  I had been running this setup for just over a year.
>> Was working fine.   I just moved into a new home and had my server
>> down for a while  - before I brought it back online, I got a "great
>> idea" to blow out the dust from the enclosure using compressed air.
>> When I finally brought up the array again, I noticed that drives were
>> missing.  Tried re-adding the drives to the array and had some issues
>> - they seemed to get added but after a short time of rebuilding the
>> array, I would get a bunch of HW resets in dmesg and then the array
>> would kick out drives and stop.
>>
> <- much snippage ->
>
>> I popped the drives out of the enclosure and into the actual tower
>> case and connected each of them to its own SATA port.  The HW resets
>> seemed to go away, but I couldn't get the array to come back online.
>>  Then I did the stupid panic (following someone's advice I shouldn't
>> have).
>>
>> thinking I should just re-create the array, I did:
>>
>> mdadm --create /dev/md0 --level=5 --raid-devices=5 /dev/sd[b-f]1
>>
>> Stupid me again - ignores the warning that it belongs to an array
>> already.  I let it build for a minute or so and then tried to mount it
>> while rebuilding... and got error messages:
>>
>> EXT3-fs: unable to read superblock
>> EXT3-fs: md0: couldn't mount because of unsupported optional features
>> (3fd18e00).
>>
>> Now - I'm at a loss.  I'm afraid to do anything else.   I've been
>> viewing the FAQ and I have a few ideas, but I'm just more freaked.  Is
>> there any hope?  What should I do next without causing more trouble?
>>
> Looking at the mdadm output, there's a couple of possible errors.
> Firstly, your newly created array has a different chunksize than your
> original one.  Secondly, the drives may be in the wrong order.  In
> either case, providing you don't _actually_ have any faulty drives, then
> it should be (mostly) recoverable.
>
> Given the order you specified the drives in the create, sdf1 will be the
> partition that's been trashed by the rebuild, so you'll want to leave
> that out altogether for now.
>
> You need to try to recreate the array with the correct chunk size and
> with the remaining drives in different orders, running a read-only
> filesystem check each time until you find the correct order.
>
> So start with:
>    mdadm -C /dev/md0 -l 5 -n 5 -c 256 /dev/sd[bcde]1 missing
>
> Then repeat for every possible order of the four disks and "missing",
> stopping the array each time if the mount fails.
>
> When you've finally found the correct order, you can re-add sdf1 to get
> the array back to normal.
>
> HTH,
>    Robin
> --
>     ___
>    ( ' }     |       Robin Hill        <robin@xxxxxxxxxxxxxxx> |
>   / / )      | Little Jim says ....                            |
>  // !!       |      "He fallen in de water !!"                 |
>

-- 
-tim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html