Re: RAID 5 array recovery - two drives errors in external enclosure

Tim Bostrom <tbostrom@xxxxxxxxx> · Thu, 17 Sep 2009 21:19:58 -0700

This seemed to work, though I'm still working through the permutations
of the drive letters.

I noticed that mdadm think that partition sde1 is ext2 filesystem on
it.  See below:

[root@tera tbostrom]# mdadm -C /dev/md0 -l 5 -n 5 -c 256 /dev/sdb1
/dev/sdd1 /dev/sdc1 /dev/sde1 missing
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=5 ctime=Thu Sep 17 21:13:21 2009
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=5 ctime=Thu Sep 17 21:13:21 2009
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=5 ctime=Thu Sep 17 21:13:21 2009
mdadm: /dev/sde1 appears to contain an ext2fs file system
    size=396408836K  mtime=Mon Sep 21 03:41:16 2026
mdadm: /dev/sde1 appears to be part of a raid array:
    level=raid5 devices=5 ctime=Thu Sep 17 21:13:21 2009
Continue creating array?

What gives?  I tried popping sdf1 in there without creating the array
- just to see what would happen and it thinks that sdf1 has ext2 as
well.

root@tera tbostrom]# mdadm -C /dev/md0 -l 5 -n 5 -c 256 /dev/sdc1
/dev/sdb1 /dev/sde1 /dev/sdf1 missing
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=5 ctime=Thu Sep 17 21:13:21 2009
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=5 ctime=Thu Sep 17 21:13:21 2009
mdadm: /dev/sde1 appears to contain an ext2fs file system
    size=396408836K  mtime=Mon Sep 21 03:41:16 2026
mdadm: /dev/sde1 appears to be part of a raid array:
    level=raid5 devices=5 ctime=Thu Sep 17 21:13:21 2009
mdadm: /dev/sdf1 appears to contain an ext2fs file system
    size=-387928064K  mtime=Wed Sep 16 16:26:42 2009
mdadm: /dev/sdf1 appears to be part of a raid array:
    level=raid5 devices=5 ctime=Thu Sep 17 20:54:33 2009
Continue creating array? no
mdadm: create aborted.

Still at a loss here.  I haven't worked through all the drive
permutations.  In the meantime, I'll try that.  Does it make sense to
try sdf1 in the permutation since the drive letters may have changed
since moving from the enclosure?  I thought I put them back in the
same order as the enclosure.

-Tim

On Thu, Sep 17, 2009 at 6:31 PM, Guy Watkins
<linux-raid@xxxxxxxxxxxxxxxx> wrote:
> It is the way you list the drives.  Look at this command:
> # echo /dev/sd[bdce]1
> /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
>
> Notice the output is not in the same order as in the command.  You should
> list each disk in the order you want.  Like this:
> mdadm -C /dev/md0 -l 5 -n 5 -c 256 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
> missing
>
> I hope this helps.
>
> } -----Original Message-----
> } From: linux-raid-owner@xxxxxxxxxxxxxxx [mailto:linux-raid-
> } owner@xxxxxxxxxxxxxxx] On Behalf Of Tim Bostrom
> } Sent: Thursday, September 17, 2009 7:55 PM
> } To: linux-raid
> } Subject: Re: RAID 5 array recovery - two drives errors in external
> } enclosure
> }
> } It's still showing the order that you had previously posted:  [bcde]
> } (see log below)
> }
> } It appears that trying different permutations isn't yielding any
> } change.  I haven't tried every permutation, but are these commands
> } supposed to yield different effects?  They seem to always build the
> } array as [bcde] no matter what.  Or should I be swapping around the
> } cables on the drives?
> }
> } >> mdadm -C /dev/md0 -l 5 -n 5 -c 256 /dev/sd[bdce]1 missing
> } >> mdadm -C /dev/md0 -l 5 -n 5 -c 256 /dev/sd[bdec]1 missing
> } >> mdadm -C /dev/md0 -l 5 -n 5 -c 256 /dev/sd[becd]1 missing
> }
> }
> } -Tim
> }
> } [root@tera ~]# mdadm --examine /dev/sdb1
> } /dev/sdb1:
> }           Magic : a92b4efc
> }         Version : 0.90.00
> }            UUID : 9fefb6ce:dcbfe649:f456b3f0:371e8bcc
> }   Creation Time : Thu Sep 17 16:13:45 2009
> }      Raid Level : raid5
> }   Used Dev Size : 976759808 (931.51 GiB 1000.20 GB)
> }      Array Size : 3907039232 (3726.04 GiB 4000.81 GB)
> }    Raid Devices : 5
> }   Total Devices : 5
> } Preferred Minor : 0
> }
> }     Update Time : Thu Sep 17 16:13:45 2009
> }           State : clean
> }  Active Devices : 4
> } Working Devices : 4
> }  Failed Devices : 1
> }   Spare Devices : 0
> }        Checksum : 20f1deab - correct
> }          Events : 1
> }
> }          Layout : left-symmetric
> }      Chunk Size : 256K
> }
> }       Number   Major   Minor   RaidDevice State
> } this     0       8       17        0      active sync   /dev/sdb1
> }
> }    0     0       8       17        0      active sync   /dev/sdb1
> }    1     1       8       33        1      active sync   /dev/sdc1
> }    2     2       8       49        2      active sync   /dev/sdd1
> }    3     3       8       65        3      active sync   /dev/sde1
> }    4     4       0        0        4      faulty
> }
> }
> }
> } On Thu, Sep 17, 2009 at 4:28 PM, Majed B. <majedb@xxxxxxxxx> wrote:
> } > Before creating the array, did you re-examine the disks with mdadm and
> } > made sure of each disk's position in the array?
> } >
> } > After your recabling, the disk names may have changed again.
> } >
> } > mdadm --examine /dev/sdb1
> } >
> } >      Number   Major   Minor   RaidDevice State
> } > this     7       8       17        7      active sync   /dev/sdb1
> } >
> } >   0     0       8      113        0      active sync   /dev/sdh1
> } >   1     1       8       97        1      active sync   /dev/sdg1
> } >   2     2       0        0        2      faulty removed
> } >   3     3       0        0        3      faulty removed
> } >   4     4       8       33        4      active sync   /dev/sdc1
> } >   5     5       8       65        5      active sync   /dev/sde1
> } >   6     6       8       49        6      active sync   /dev/sdd1
> } >   7     7       8       17        7      active sync   /dev/sdb1
> } >
> } > (That's the output of an array I'm working on)
> } >
> } > Notice the first line: *this* and then the value of RaidDevice. That's
> } > the position of the partition in the array. 0 is first, 1 is second,
> } > and so on.
> } >
> } > In my case, the order is: sdh1,sdg1,missing,missing,sdc1,sde1,sdd1,sdb1
> } >
> } > On Fri, Sep 18, 2009 at 2:11 AM, Tim Bostrom <tbostrom@xxxxxxxxx> wrote:
> } >> I re-cabled the drives so that they show up as the same drive letter
> } >> as they were before when in the enclosure.
> } >>
> } >> I then went ahead and tried your idea of restarting the array. I tried
> } >> this first:
> } >>
> } >> mdadm -C /dev/md0 -l 5 -n 5 -c 256 /dev/sd[bcde]1 missing
> } >>
> } >> mount -o ro /dev/md0 /mnt/teradata
> } >>
> } >> /var/log/messages:
> } >> -----------------
> } >> Sep 17 16:07:09 tera kernel: md: bind<sdb1>
> } >> Sep 17 16:07:09 tera kernel: md: bind<sdc1>
> } >> Sep 17 16:07:09 tera kernel: md: bind<sdd1>
> } >> Sep 17 16:07:09 tera kernel: md: bind<sde1>
> } >> Sep 17 16:07:09 tera kernel: raid5: device sde1 operational as raid
> } disk 3
> } >> Sep 17 16:07:09 tera kernel: raid5: device sdd1 operational as raid
> } disk 2
> } >> Sep 17 16:07:09 tera kernel: raid5: device sdc1 operational as raid
> } disk 1
> } >> Sep 17 16:07:09 tera kernel: raid5: device sdb1 operational as raid
> } disk 0
> } >> Sep 17 16:07:09 tera kernel: raid5: allocated 5268kB for md0
> } >> Sep 17 16:07:09 tera kernel: raid5: raid level 5 set md0 active with 4
> } >> out of 5 devices, algorithm 2
> } >> Sep 17 16:07:09 tera kernel: RAID5 conf printout:
> } >> Sep 17 16:07:09 tera kernel: --- rd:5 wd:4
> } >> Sep 17 16:07:09 tera kernel: disk 0, o:1, dev:sdb1
> } >> Sep 17 16:07:09 tera kernel: disk 1, o:1, dev:sdc1
> } >> Sep 17 16:07:09 tera kernel: disk 2, o:1, dev:sdd1
> } >> Sep 17 16:07:09 tera kernel: disk 3, o:1, dev:sde1
> } >> Sep 17 16:07:56 tera kernel: EXT3-fs error (device md0):
> } >> ext3_check_descriptors: Block bitmap for group 8064 not in group
> } >> (block 532677632)!
> } >> Sep 17 16:07:56 tera kernel: EXT3-fs: group descriptors corrupted!
> } >> --------------------------------
> } >>
> } >>
> } >> I then tried a few more permutations of the command:
> } >> mdadm -C /dev/md0 -l 5 -n 5 -c 256 /dev/sd[bdce]1 missing
> } >> mdadm -C /dev/md0 -l 5 -n 5 -c 256 /dev/sd[bdec]1 missing
> } >> mdadm -C /dev/md0 -l 5 -n 5 -c 256 /dev/sd[becd]1 missing
> } >>
> } >> Every time I changed the order, it would still print the order the
> } >> same in the log:
> } >>
> } >> Sep 17 16:02:52 tera kernel: md: bind<sdb1>
> } >> Sep 17 16:02:52 tera kernel: md: bind<sdc1>
> } >> Sep 17 16:02:52 tera kernel: md: bind<sdd1>
> } >> Sep 17 16:02:52 tera kernel: md: bind<sde1>
> } >> Sep 17 16:02:52 tera kernel: raid5: device sde1 operational as raid
> } disk 3
> } >> Sep 17 16:02:52 tera kernel: raid5: device sdd1 operational as raid
> } disk 2
> } >> Sep 17 16:02:52 tera kernel: raid5: device sdc1 operational as raid
> } disk 1
> } >> Sep 17 16:02:52 tera kernel: raid5: device sdb1 operational as raid
> } disk 0
> } >> Sep 17 16:02:52 tera kernel: raid5: allocated 5268kB for md0
> } >> Sep 17 16:02:52 tera kernel: raid5: raid level 5 set md0 active with 4
> } >> out of 5 devices, algorithm 2
> } >> Sep 17 16:02:52 tera kernel: RAID5 conf printout:
> } >> Sep 17 16:02:52 tera kernel: --- rd:5 wd:4
> } >> Sep 17 16:02:52 tera kernel: disk 0, o:1, dev:sdb1
> } >> Sep 17 16:02:52 tera kernel: disk 1, o:1, dev:sdc1
> } >> Sep 17 16:02:52 tera kernel: disk 2, o:1, dev:sdd1
> } >> Sep 17 16:02:52 tera kernel: disk 3, o:1, dev:sde1
> } >>
> } >>
> } >>
> } >> Am I doing something wrong?
> } >>
> } >>
> } >>
> } >>
> } >> On Thu, Sep 17, 2009 at 2:22 PM, Robin Hill <robin@xxxxxxxxxxxxxxx>
> } wrote:
> } >>> On Thu Sep 17, 2009 at 01:42:30PM -0700, Tim Bostrom wrote:
> } >>>
> } >>>> OK,
> } >>>>
> } >>>> Let me start off by saying - I panicked.  Rule #1 - don't panic.  I
> } >>>> did.  Sorry.
> } >>>>
> } >>>> I have a RAID 5 array running on Fedora 10.
> } >>>> (Linux tera.teambostrom.com 2.6.27.30-170.2.82.fc10.i686 #1 SMP Mon
> } >>>> Aug 17 08:38:59 EDT 2009 i686 athlon i386 GNU/Linux)
> } >>>>
> } >>>> 5 drives in an external enclosure (AMS eSATA Venus T5).  It's a
> } >>>> Sil4726 inside the enclosure running to a Sil3132 controller via
> } eSATA
> } >>>> in the desktop.  I had been running this setup for just over a year.
> } >>>> Was working fine.   I just moved into a new home and had my server
> } >>>> down for a while  - before I brought it back online, I got a "great
> } >>>> idea" to blow out the dust from the enclosure using compressed air.
> } >>>> When I finally brought up the array again, I noticed that drives were
> } >>>> missing.  Tried re-adding the drives to the array and had some issues
> } >>>> - they seemed to get added but after a short time of rebuilding the
> } >>>> array, I would get a bunch of HW resets in dmesg and then the array
> } >>>> would kick out drives and stop.
> } >>>>
> } >>> <- much snippage ->
> } >>>
> } >>>> I popped the drives out of the enclosure and into the actual tower
> } >>>> case and connected each of them to its own SATA port.  The HW resets
> } >>>> seemed to go away, but I couldn't get the array to come back online.
> } >>>>  Then I did the stupid panic (following someone's advice I shouldn't
> } >>>> have).
> } >>>>
> } >>>> thinking I should just re-create the array, I did:
> } >>>>
> } >>>> mdadm --create /dev/md0 --level=5 --raid-devices=5 /dev/sd[b-f]1
> } >>>>
> } >>>> Stupid me again - ignores the warning that it belongs to an array
> } >>>> already.  I let it build for a minute or so and then tried to mount
> } it
> } >>>> while rebuilding... and got error messages:
> } >>>>
> } >>>> EXT3-fs: unable to read superblock
> } >>>> EXT3-fs: md0: couldn't mount because of unsupported optional features
> } >>>> (3fd18e00).
> } >>>>
> } >>>> Now - I'm at a loss.  I'm afraid to do anything else.   I've been
> } >>>> viewing the FAQ and I have a few ideas, but I'm just more freaked.
> }  Is
> } >>>> there any hope?  What should I do next without causing more trouble?
> } >>>>
> } >>> Looking at the mdadm output, there's a couple of possible errors.
> } >>> Firstly, your newly created array has a different chunksize than your
> } >>> original one.  Secondly, the drives may be in the wrong order.  In
> } >>> either case, providing you don't _actually_ have any faulty drives,
> } then
> } >>> it should be (mostly) recoverable.
> } >>>
> } >>> Given the order you specified the drives in the create, sdf1 will be
> } the
> } >>> partition that's been trashed by the rebuild, so you'll want to leave
> } >>> that out altogether for now.
> } >>>
> } >>> You need to try to recreate the array with the correct chunk size and
> } >>> with the remaining drives in different orders, running a read-only
> } >>> filesystem check each time until you find the correct order.
> } >>>
> } >>> So start with:
> } >>>    mdadm -C /dev/md0 -l 5 -n 5 -c 256 /dev/sd[bcde]1 missing
> } >>>
> } >>> Then repeat for every possible order of the four disks and "missing",
> } >>> stopping the array each time if the mount fails.
> } >>>
> } >>> When you've finally found the correct order, you can re-add sdf1 to
> } get
> } >>> the array back to normal.
> } >>>
> } >>> HTH,
> } >>>    Robin
> } >>> --
> } >>>     ___
> } >>>    ( ' }     |       Robin Hill        <robin@xxxxxxxxxxxxxxx> |
> } >>>   / / )      | Little Jim says ....                            |
> } >>>  // !!       |      "He fallen in de water !!"                 |
> } >>>
> } >>
> } >>
> } >>
> } >> --
> } >> -tim
> } >> --
> } >> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> } in
> } >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> } >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> } >>
> } >
> } >
> } >
> } > --
> } >       Majed B.
> } >
> }
> }
> }
> } --
> } -tim
> } --
> } To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> } the body of a message to majordomo@xxxxxxxxxxxxxxx
> } More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

-- 
-tim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html