Re: Failed RAID 6 array advice

NeilBrown <neilb@xxxxxxx> · Wed, 2 Mar 2011 16:26:57 +1100

On Tue, 1 Mar 2011 21:05:33 -0800 (PST) jahammonds prost <gmitch64@xxxxxxxxx>
wrote:

> I've just had a 3rd drive fail on one of my RAID 6 arrays, and I'm looking for 
> some advice on how to get it back enough that I can recover the data, and then 
> replacing the other failed drives.
> 
> 
> mdadm -V
> mdadm - v3.0.3 - 22nd October 2009
> 
> 
> Not the most up to date release, but it seems to be the latest one available on 
> FC12
> 
> 
> 
> The /etc/mdadm.conf file is
> 
> ARRAY /dev/md0 uuid=1470c671:4236b155:67287625:899db153
> 
> 
> Which explains why I didn't get emailed about the drive failures. This isn't my 
> standard file, and I don't know how it was changed, but that's another issue for 
> another day.
> 
> 
> 
> mdadm --detail /dev/md0
> /dev/md0:
>         Version : 1.2
>   Creation Time : Sat Jun  5 10:38:11 2010
>      Raid Level : raid6
>   Used Dev Size : 488383488 (465.76 GiB 500.10 GB)
>    Raid Devices : 15
>   Total Devices : 12
>     Persistence : Superblock is persistent
>     Update Time : Tue Mar  1 22:17:41 2011
>           State : active, degraded, Not Started
>  Active Devices : 12
> Working Devices : 12
>  Failed Devices : 0
>   Spare Devices : 0
>      Chunk Size : 512K
>            Name : file00bert.woodlea.org.uk:0  (local to host 
> file00bert.woodlea.org.uk)
>            UUID : 1470c671:4236b155:67287625:899db153
>          Events : 254890
>     Number   Major   Minor   RaidDevice State
>        0       8      113        0      active sync   /dev/sdh1
>        1       8       17        1      active sync   /dev/sdb1
>        2       8      177        2      active sync   /dev/sdl1
>        3       0        0        3      removed
>        4       8       33        4      active sync   /dev/sdc1
>        5       8      193        5      active sync   /dev/sdm1
>        6       0        0        6      removed
>        7       8       49        7      active sync   /dev/sdd1
>        8       8      209        8      active sync   /dev/sdn1
>        9       8      161        9      active sync   /dev/sdk1
>       10       0        0       10      removed
>       11       8      225       11      active sync   /dev/sdo1
>       12       8       81       12      active sync   /dev/sdf1
>       13       8      241       13      active sync   /dev/sdp1
>       14       8        1       14      active sync   /dev/sda1
> 
> 
> 
> The output from the failed drives are as follows.
> 
> 
> mdadm --examine /dev/sde1
> /dev/sde1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x1
>      Array UUID : 1470c671:4236b155:67287625:899db153
>            Name : file00bert.woodlea.org.uk:0  (local to host 
> file00bert.woodlea.org.uk)
>   Creation Time : Sat Jun  5 10:38:11 2010
>      Raid Level : raid6
>    Raid Devices : 15
>  Avail Dev Size : 976767730 (465.76 GiB 500.11 GB)
>      Array Size : 12697970688 (6054.86 GiB 6501.36 GB)
>   Used Dev Size : 976766976 (465.76 GiB 500.10 GB)
>     Data Offset : 272 sectors
>    Super Offset : 8 sectors
>           State : clean
>     Device UUID : 3e284f2e:d939fb97:0b74eb88:326e879c
> Internal Bitmap : 2 sectors from superblock
>     Update Time : Tue Mar  1 21:53:31 2011
>        Checksum : 768f0f34 - correct
>          Events : 254591
>      Chunk Size : 512K
>    Device Role : Active device 10
>    Array State : AAA.AA.AAAAAAAA ('A' == active, '.' == missing)
> 
> 
> The above is the drive that failed tonight, and the one I would like to re add 
> back into the array. There have been no writes to the filesystem on the array in 
> the last couple of days (other than what ext4 would do on it's own).
> 
> 
>  mdadm --examine /dev/sdi1
> /dev/sdi1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x1
>      Array UUID : 1470c671:4236b155:67287625:899db153
>            Name : file00bert.woodlea.org.uk:0  (local to host 
> file00bert.woodlea.org.uk)
>   Creation Time : Sat Jun  5 10:38:11 2010
>      Raid Level : raid6
>    Raid Devices : 15
>  Avail Dev Size : 976767730 (465.76 GiB 500.11 GB)
>      Array Size : 12697970688 (6054.86 GiB 6501.36 GB)
>   Used Dev Size : 976766976 (465.76 GiB 500.10 GB)
>     Data Offset : 272 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : 8e668e39:06d8281b:b79aa3ab:a1d55fb5
> Internal Bitmap : 2 sectors from superblock
>     Update Time : Thu Feb 10 18:20:54 2011
>        Checksum : 4078396b - correct
>          Events : 254075
>      Chunk Size : 512K
>    Device Role : Active device 3
>    Array State : AAAAAA.AAAAAAAA ('A' == active, '.' == missing)
> 
> 
> mdadm --examine /dev/sdj1
> /dev/sdj1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x1
>      Array UUID : 1470c671:4236b155:67287625:899db153
>            Name : file00bert.woodlea.org.uk:0  (local to host 
> file00bert.woodlea.org.uk)
>   Creation Time : Sat Jun  5 10:38:11 2010
>      Raid Level : raid6
>    Raid Devices : 15
>  Avail Dev Size : 976767730 (465.76 GiB 500.11 GB)
>      Array Size : 12697970688 (6054.86 GiB 6501.36 GB)
>   Used Dev Size : 976766976 (465.76 GiB 500.10 GB)
>     Data Offset : 272 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : 37d422cc:8436960a:c3c4d11c:81a8e4fa
> Internal Bitmap : 2 sectors from superblock
>     Update Time : Thu Oct 21 23:45:06 2010
>        Checksum : 78950bb5 - correct
>          Events : 21435
>      Chunk Size : 512K
>    Device Role : Active device 6
>    Array State : AAAAAAAAAAAAAAA ('A' == active, '.' == missing)
> 
> 
> Looks like sdj1 failed waaay back in Oct last year (sigh). As I said, I am not 
> to bothered about adding these last 2 drives back into the array, since they 
> failed so long ago. I have a couple of spare drives sitting here, and I will 
> replace these 2 drives with them (once I have completed a badblocks on them). 
> Looking at the output of dmesg, there are no other errors showing for the 3 
> drives, other than them being kicked out of the array for being non fresh.
> 
> I guess I have a couple of questions.
> 
> What's the correct process for adding the failed /dev/sde1 back into the array 
> so I can start it. I don't want to rush into this and make things worse.

If you think that the drives really are working and that it was a cabling
problem then stop the array (if it isn't stopped already) and assemble with
--force:

 mdadm --assemble --force /dev/md0 /dev....list of devices

Then find the devices that it chose not to include and add them individually
  mdadm /dev/md0 --add /dev/something

However if any device has a bad block that cannot be read, then this won't
work.
In that case you need to get a new device, partition it to have a partition
EXACTLY the same size, use
  dd_rescue
to copy all the good data from the bad drive to the new drive, remove the bad
drive from the system, and use the "--assemble --force" command using the new
drive, not the old drive.

> 
> What's the correct process for replacing the 2 other drives?
> I am presuming that I need to --fail, then --remove then --add the drives (one 
> at a time?), but I want to make sure.

There are already failed and removed so there is no point in trying to do
that again

Good luck.

NeilBrown

> 
> 
> Thanks for your help.
> 
> 
> Graham.
> 
> 
>       
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html