Re: How to identify a failed md array

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 26 May 2014 20:07:11 +0200 Sebastian Herbszt <herbszt@xxxxxx> wrote:

> Hello,
> 
> I am wondering how to identify a failed md array.
> Lets assume the following array
> 
> /dev/md0:
>         Version : 1.2
>   Creation Time : Mon May 26 19:10:59 2014
>      Raid Level : raid1
>      Array Size : 10176 (9.94 MiB 10.42 MB)
>   Used Dev Size : 10176 (9.94 MiB 10.42 MB)
>    Raid Devices : 2
>   Total Devices : 2
>     Persistence : Superblock is persistent
> 
>     Update Time : Mon May 26 19:10:59 2014
>           State : clean
>  Active Devices : 2
> Working Devices : 2
>  Failed Devices : 0
>   Spare Devices : 0
> 
>            Name : test:0  (local to host test)
>            UUID : cac8fd48:44219a96:5de7e757:4e21a3e2
>          Events : 17
> 
>     Number   Major   Minor   RaidDevice State
>        0     254        0        0      active sync   /dev/dm-0
>        1     254        1        1      active sync   /dev/dm-1
> 
> with
> 
> /sys/block/md0/md/array_state:clean
> /sys/block/md0/md/dev-dm-0/state:in_sync
> /sys/block/md0/md/dev-dm-1/state:in_sync
> 
> and
> 
> disk0: 0 20480 linear 7:0 0
> disk1: 0 20480 linear 7:1 0
> 
> If dm-0 gets changed to "disk0: 0 20480 error" and we read from the
> array (dd if=/dev/md0 count=1 iflag=direct of=/dev/null) the broken
> disk gets detected by md:
> 
> [84688.483607] md/raid1:md0: dm-0: rescheduling sector 0
> [84688.483654] md/raid1:md0: redirecting sector 0 to other mirror: dm-1
> [84688.483670] md: super_written gets error=-5, uptodate=0
> [84688.483672] md/raid1:md0: Disk failure on dm-0, disabling device.
> md/raid1:md0: Operation continuing on 1 devices.
> [84688.483676] md: super_written gets error=-5, uptodate=0
> [84688.494174] RAID1 conf printout:
> [84688.494178]  --- wd:1 rd:2
> [84688.494181]  disk 0, wo:1, o:0, dev:dm-0
> [84688.494182]  disk 1, wo:0, o:1, dev:dm-1
> [84688.494183] RAID1 conf printout:
> [84688.494184]  --- wd:1 rd:2
> [84688.494184]  disk 1, wo:0, o:1, dev:dm-1
> 
> /dev/md0:
>         Version : 1.2
>   Creation Time : Mon May 26 19:10:59 2014
>      Raid Level : raid1
>      Array Size : 10176 (9.94 MiB 10.42 MB)
>   Used Dev Size : 10176 (9.94 MiB 10.42 MB)
>    Raid Devices : 2
>   Total Devices : 2
>     Persistence : Superblock is persistent
> 
>     Update Time : Mon May 26 19:27:41 2014
>           State : clean, degraded
>  Active Devices : 1
> Working Devices : 1
>  Failed Devices : 1
>   Spare Devices : 0
> 
>            Name : test:0  (local to host test)
>            UUID : cac8fd48:44219a96:5de7e757:4e21a3e2
>          Events : 20
> 
>     Number   Major   Minor   RaidDevice State
>        0       0        0        0      removed
>        1     254        1        1      active sync   /dev/dm-1
> 
>        0     254        0        -      faulty   /dev/dm-0
> 
> md0 : active raid1 dm-1[1] dm-0[0](F)
>       10176 blocks super 1.2 [2/1] [_U]
> 
> /sys/block/md0/md/array_state:clean
> /sys/block/md0/md/dev-dm-0/state:faulty,write_error
> /sys/block/md0/md/dev-dm-1/state:in_sync
> /sys/block/md0/md/degraded:1
> 
> However if I also change dm-1 to "disk1: 0 20480 error" and read
> again there is no visible state change:
> 
> /dev/md0:
>         Version : 1.2
>   Creation Time : Mon May 26 19:10:59 2014
>      Raid Level : raid1
>      Array Size : 10176 (9.94 MiB 10.42 MB)
>   Used Dev Size : 10176 (9.94 MiB 10.42 MB)
>    Raid Devices : 2
>   Total Devices : 2
>     Persistence : Superblock is persistent
> 
>     Update Time : Mon May 26 19:27:41 2014
>           State : clean, degraded
>  Active Devices : 1
> Working Devices : 1
>  Failed Devices : 1
>   Spare Devices : 0
> 
>     Number   Major   Minor   RaidDevice State
>        0       0        0        0      removed
>        1     254        1        1      active sync   /dev/dm-1
> 
>        0     254        0        -      faulty   /dev/dm-0
> 
> md0 : active raid1 dm-1[1] dm-0[0](F)
>       10176 blocks super 1.2 [2/1] [_U]
> 
> /sys/block/md0/md/array_state:clean
> /sys/block/md0/md/dev-dm-0/state:faulty,write_error
> /sys/block/md0/md/dev-dm-1/state:in_sync
> /sys/block/md0/md/degraded:1
> 
> On write to the array we get
> 
> [85498.660247] md: super_written gets error=-5, uptodate=0
> [85498.666464] quiet_error: 268 callbacks suppressed
> [85498.666470] Buffer I/O error on device md0, logical block 2528
> [85498.666476] Buffer I/O error on device md0, logical block 2528
> [85498.666486] Buffer I/O error on device md0, logical block 2542
> [85498.666490] Buffer I/O error on device md0, logical block 2542
> [85498.666496] Buffer I/O error on device md0, logical block 0
> [85498.666499] Buffer I/O error on device md0, logical block 0
> [85498.666508] Buffer I/O error on device md0, logical block 1
> [85498.666512] Buffer I/O error on device md0, logical block 1
> [85498.666518] Buffer I/O error on device md0, logical block 2543
> [85498.666524] Buffer I/O error on device md0, logical block 2543
> [85498.866388] md: super_written gets error=-5, uptodate=0
> 
> and the only change is
> 
> /sys/block/md0/md/dev-dm-1/state:in_sync,write_error,want_replacement
> 
> How can I identify a failed array?
> array_state reports "clean", the last raid member stays "in_sync" and
> the value in degraded doesn't equal raid_disks.

You know the array is "failed" when you get an IO error.

When a RAID1 array gets down to just one drive remaining, it starts acting
like it is just one drive.
How do you tell if is single plain ordinary drive is failed?  You get an IO
error.  ditto with RAID1.

NeilBrown

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux