Failed Array Rebuild advice Please

jahammonds prost <gmitch64@xxxxxxxxx> · Tue, 10 Apr 2012 15:32:44 -0700 (PDT)

For various reasons, the email notifications on my RAID6 array wasn't working, and 2 of the 15 drives failed out. I noticed this last week as I was about to move the server into a new case. As part of the move, I upgraded the OS to the latest CentOS, as I was having issues with the existing install and the new HBA card (a SASLP-MV8).

When the server came back up, for some reason it decided to fire up the md array with only 1 drive - and it incremented the Event count on that 1 drive (and since I'm running with 2 failed drives on a RAID6, I couldn't kick the drive out and let it rebuild).

The array shows this...

 mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Sat Jun  5 10:38:11 2010
     Raid Level : raid6
  Used Dev Size : 488383488 (465.76 GiB 500.10 GB)
   Raid Devices : 15
  Total Devices : 12
    Persistence : Superblock is persistent
    Update Time : Mon Apr  9 13:05:31 2012
          State : active, FAILED, Not Started
 Active Devices : 12
Working Devices : 12
 Failed Devices : 0
  Spare Devices : 0
         Layout : left-symmetric
     Chunk Size : 512K
           Name : file00bert.woodlea.org.uk:0  (local to host file00bert.woodlea.org.uk)
           UUID : 1470c671:4236b155:67287625:899db153
         Events : 1378022
    Number   Major   Minor   RaidDevice State
       0       8      113        0      active sync   /dev/sdh1
       1       8      209        1      active sync   /dev/sdn1
       2       8      225        2      active sync   /dev/sdo1
      15       8       17        3      active sync   /dev/sdb1
       4       8      145        4      active sync   /dev/sdj1
       5       8      161        5      active sync   /dev/sdk1
       6       0        0        6      removed
       7       8       81        7      active sync   /dev/sdf1
       8       8       97        8      active sync   /dev/sdg1
      16       8       65        9      active sync   /dev/sde1
      10       8       33       10      active sync   /dev/sdc1
      11       0        0       11      removed
      12       8      177       12      active sync   /dev/sdl1
      13       8      241       13      active sync   /dev/sdp1
      14       0        0       14      removed

Looking at the Event count on all the drives as they currently are, they show this

sda1 1378024
sdb1 1378022
sdc1 1378022
sdd1 1362956
sde1 1378022
sdf1 1378022
sdg1 1378022
sdh1 1378022
sdj1 1378022
sdk1 1378022
sdl1 1378022
sdm1  616796
sdn1 1378022
sdo1 1378022
sdp1 1378022

So, /dev/sdd1 and /dev/sdm1 are the 2 failed drives. The Event count on all the other drives agree with each other, and with that of the array, except for /dev/sda1, which is a couple of events higher than everything else - and with that I can't start the array.

Since I know I did nothing with the temp one drive array when the server was booted (and I don't think that the md code did anything either??) would it be safe to 

mdadm --assemble /dev/md0 /dev/sd[a-c]1 /dev/sd[e-h]1 /dev/sd[j-l]1 /dev/sd[n-p]1 --force

to let the array come back up and get it running?

What would then be the correct sequence to replace the 2 failed drives (sdd1 and sdm1) and get the array running fully again?

Thanks for your help.

YP.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html