Failed RAID 6 array advice

jahammonds prost <gmitch64@xxxxxxxxx> · Tue, 1 Mar 2011 21:05:33 -0800 (PST)

I've just had a 3rd drive fail on one of my RAID 6 arrays, and I'm looking for 
some advice on how to get it back enough that I can recover the data, and then 
replacing the other failed drives.

mdadm -V
mdadm - v3.0.3 - 22nd October 2009

Not the most up to date release, but it seems to be the latest one available on 
FC12

The /etc/mdadm.conf file is

ARRAY /dev/md0 uuid=1470c671:4236b155:67287625:899db153

Which explains why I didn't get emailed about the drive failures. This isn't my 
standard file, and I don't know how it was changed, but that's another issue for 
another day.

mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Sat Jun  5 10:38:11 2010
     Raid Level : raid6
  Used Dev Size : 488383488 (465.76 GiB 500.10 GB)
   Raid Devices : 15
  Total Devices : 12
    Persistence : Superblock is persistent
    Update Time : Tue Mar  1 22:17:41 2011
          State : active, degraded, Not Started
 Active Devices : 12
Working Devices : 12
 Failed Devices : 0
  Spare Devices : 0
     Chunk Size : 512K
           Name : file00bert.woodlea.org.uk:0  (local to host 
file00bert.woodlea.org.uk)
           UUID : 1470c671:4236b155:67287625:899db153
         Events : 254890
    Number   Major   Minor   RaidDevice State
       0       8      113        0      active sync   /dev/sdh1
       1       8       17        1      active sync   /dev/sdb1
       2       8      177        2      active sync   /dev/sdl1
       3       0        0        3      removed
       4       8       33        4      active sync   /dev/sdc1
       5       8      193        5      active sync   /dev/sdm1
       6       0        0        6      removed
       7       8       49        7      active sync   /dev/sdd1
       8       8      209        8      active sync   /dev/sdn1
       9       8      161        9      active sync   /dev/sdk1
      10       0        0       10      removed
      11       8      225       11      active sync   /dev/sdo1
      12       8       81       12      active sync   /dev/sdf1
      13       8      241       13      active sync   /dev/sdp1
      14       8        1       14      active sync   /dev/sda1

The output from the failed drives are as follows.

mdadm --examine /dev/sde1
/dev/sde1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 1470c671:4236b155:67287625:899db153
           Name : file00bert.woodlea.org.uk:0  (local to host 
file00bert.woodlea.org.uk)
  Creation Time : Sat Jun  5 10:38:11 2010
     Raid Level : raid6
   Raid Devices : 15
 Avail Dev Size : 976767730 (465.76 GiB 500.11 GB)
     Array Size : 12697970688 (6054.86 GiB 6501.36 GB)
  Used Dev Size : 976766976 (465.76 GiB 500.10 GB)
    Data Offset : 272 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 3e284f2e:d939fb97:0b74eb88:326e879c
Internal Bitmap : 2 sectors from superblock
    Update Time : Tue Mar  1 21:53:31 2011
       Checksum : 768f0f34 - correct
         Events : 254591
     Chunk Size : 512K
   Device Role : Active device 10
   Array State : AAA.AA.AAAAAAAA ('A' == active, '.' == missing)

The above is the drive that failed tonight, and the one I would like to re add 
back into the array. There have been no writes to the filesystem on the array in 
the last couple of days (other than what ext4 would do on it's own).

 mdadm --examine /dev/sdi1
/dev/sdi1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 1470c671:4236b155:67287625:899db153
           Name : file00bert.woodlea.org.uk:0  (local to host 
file00bert.woodlea.org.uk)
  Creation Time : Sat Jun  5 10:38:11 2010
     Raid Level : raid6
   Raid Devices : 15
 Avail Dev Size : 976767730 (465.76 GiB 500.11 GB)
     Array Size : 12697970688 (6054.86 GiB 6501.36 GB)
  Used Dev Size : 976766976 (465.76 GiB 500.10 GB)
    Data Offset : 272 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 8e668e39:06d8281b:b79aa3ab:a1d55fb5
Internal Bitmap : 2 sectors from superblock
    Update Time : Thu Feb 10 18:20:54 2011
       Checksum : 4078396b - correct
         Events : 254075
     Chunk Size : 512K
   Device Role : Active device 3
   Array State : AAAAAA.AAAAAAAA ('A' == active, '.' == missing)

mdadm --examine /dev/sdj1
/dev/sdj1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 1470c671:4236b155:67287625:899db153
           Name : file00bert.woodlea.org.uk:0  (local to host 
file00bert.woodlea.org.uk)
  Creation Time : Sat Jun  5 10:38:11 2010
     Raid Level : raid6
   Raid Devices : 15
 Avail Dev Size : 976767730 (465.76 GiB 500.11 GB)
     Array Size : 12697970688 (6054.86 GiB 6501.36 GB)
  Used Dev Size : 976766976 (465.76 GiB 500.10 GB)
    Data Offset : 272 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 37d422cc:8436960a:c3c4d11c:81a8e4fa
Internal Bitmap : 2 sectors from superblock
    Update Time : Thu Oct 21 23:45:06 2010
       Checksum : 78950bb5 - correct
         Events : 21435
     Chunk Size : 512K
   Device Role : Active device 6
   Array State : AAAAAAAAAAAAAAA ('A' == active, '.' == missing)

Looks like sdj1 failed waaay back in Oct last year (sigh). As I said, I am not 
to bothered about adding these last 2 drives back into the array, since they 
failed so long ago. I have a couple of spare drives sitting here, and I will 
replace these 2 drives with them (once I have completed a badblocks on them). 
Looking at the output of dmesg, there are no other errors showing for the 3 
drives, other than them being kicked out of the array for being non fresh.

I guess I have a couple of questions.

What's the correct process for adding the failed /dev/sde1 back into the array 
so I can start it. I don't want to rush into this and make things worse.

What's the correct process for replacing the 2 other drives?
I am presuming that I need to --fail, then --remove then --add the drives (one 
at a time?), but I want to make sure.

Thanks for your help.

Graham.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html