Fast failing a disk.

Benjamin ESTRABAUD <be@xxxxxxxxxx> · Tue, 09 Mar 2010 15:29:16 +0000

Hi Everyone,

In case of a RAID 5 on a few disks, with a disk pulled from the 
enclosure at some stage, we get a disk failure.

If doing only reads on the RAID 5 array, the time it takes between the 
actual physical removal of the disk and the disk being failed by md is 
about 4 seconds on my system.

I am trying, using scripting, to speed up this 4 seconds.

The problem being that it takes 4 seconds to fail a disk because we 
probably need that much time to determine for sure that the disk has 
really failed or is really gone, I am using the LSI SAS MPT Fusion 
driver to detect when the physical disk phy has gone offline (which 
happens in near real time), and from there, I send a "./mdadm --fail 
/dev/md/dX /dev/sdX" command to mdadm immediately after detecting this 
failure.

This is therefore much much much faster than waiting for all the 
timeouts present in sd/md.

It is non dangerous too since when a disk phy has gone offline, it is 
physically offline, no need to use any timeouts or retry.

All the above work fine and would solve my problem of speeding up the 
failing of a disk except that when I send the --fail command immediately 
after the disk has been removed from the enclosure, the md array seems 
to loop into a "resync" mode instead of reconfiguring itself as a 
degraded array.

Here is a copy of dmesg showing the phenomenon:

[  744.631035] ioc2 Event: 0xf
[  745.452060] ioc2 Event: SAS_DISCOVERY
[  745.465349] ioc2: Phy 13 Handle a sas addr: 0x50015b22300009e0 is now 
offline
[  745.485209] ioc2 Event: SAS_DISCOVERY
[  745.693492] raid5: Disk failure on sdg2, disabling device.
[  745.693497] raid5: Operation continuing on 4 devices.
[  745.718787] md: recovery of RAID array md_d0
[  745.723047] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  745.728862] md: using maximum available idle IO bandwidth (but not 
more than 200000 KB/sec) for recovery.
[  745.738402] md: using 2048k window, over a total of 9429760 blocks.
[  745.744655] md: resuming recovery of md_d0 from checkpoint.
[  745.756660] md: md_d0: recovery done.
[  745.761900] RAID5 conf printout:
[  745.765148]  --- rd:5 wd:4
[  745.767855]  disk 0, o:1, dev:sdd2
[  745.771243]  disk 1, o:1, dev:sda2
[  745.774634]  disk 2, o:1, dev:sdc2
[  745.778029]  disk 3, o:0, dev:sdg2
[  745.781424]  disk 4, o:1, dev:sdb2
[  745.799202] md: recovery of RAID array md_d0
[  745.803460] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  745.809275] md: using maximum available idle IO bandwidth (but not 
more than 200000 KB/sec) for recovery.
[  745.818814] md: using 2048k window, over a total of 9429760 blocks.
[  745.825066] md: resuming recovery of md_d0 from checkpoint.
[  745.837066] md: md_d0: recovery done.
[  745.842872] RAID5 conf printout:
[  745.846187]  --- rd:5 wd:4
[  745.848890]  disk 0, o:1, dev:sdd2
[  745.852281]  disk 1, o:1, dev:sda2
[  745.855673]  disk 2, o:1, dev:sdc2
[  745.859064]  disk 3, o:0, dev:sdg2
[  745.862459]  disk 4, o:1, dev:sdb2
[  745.919801] md: recovery of RAID array md_d0
[  745.924090] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  745.929949] md: using maximum available idle IO bandwidth (but not 
more than 200000 KB/sec) for recovery.
[  745.939491] md: using 2048k window, over a total of 9429760 blocks.
[  745.945742] md: resuming recovery of md_d0 from checkpoint.
[  745.957745] md: md_d0: recovery done.
[  746.149794] RAID5 conf printout:
[  746.153051]  --- rd:5 wd:4
[  746.155752]  disk 0, o:1, dev:sdd2
[  746.159151]  disk 1, o:1, dev:sda2
[  746.162543]  disk 2, o:1, dev:sdc2
[  746.165939]  disk 3, o:0, dev:sdg2
[  746.169334]  disk 4, o:1, dev:sdb2
[  746.369081] md: cannot remove active disk sdg2 from md_d0 ...
[  746.374866] md: recovery of RAID array md_d0
[  746.379129] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  746.384949] md: using maximum available idle IO bandwidth (but not 
more than 200000 KB/sec) for recovery.
[  746.394489] md: using 2048k window, over a total of 9429760 blocks.
[  746.400794] md: resuming recovery of md_d0 from checkpoint.
[  746.412814] md: md_d0: recovery done.
[  746.436268] md: cannot remove active disk sdg2 from md_d0 ...
[  746.491071] RAID5 conf printout:
[  746.494321]  --- rd:5 wd:4
[  746.497027]  disk 0, o:1, dev:sdd2
[  746.500420]  disk 1, o:1, dev:sda2
[  746.503811]  disk 2, o:1, dev:sdc2
[  746.507202]  disk 3, o:0, dev:sdg2
[  746.510598]  disk 4, o:1, dev:sdb2
[  746.594835] md: recovery of RAID array md_d0
[  746.599097] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  746.604915] md: using maximum available idle IO bandwidth (but not 
more than 200000 KB/sec) for recovery.
[  746.614453] md: using 2048k window, over a total of 9429760 blocks.
[  746.620698] md: resuming recovery of md_d0 from checkpoint.
[  746.632689] md: md_d0: recovery done.
[  746.683892] RAID5 conf printout:
[  746.687118]  --- rd:5 wd:4
[  746.689825]  disk 0, o:1, dev:sdd2
[  746.693224]  disk 1, o:1, dev:sda2
[  746.696617]  disk 2, o:1, dev:sdc2
[  746.700012]  disk 3, o:0, dev:sdg2
[  746.703404]  disk 4, o:1, dev:sdb2
[  746.733530] md: recovery of RAID array md_d0
[  746.737820] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  746.743635] md: using maximum available idle IO bandwidth (but not 
more than 200000 KB/sec) for recovery.
[  746.753171] md: using 2048k window, over a total of 9429760 blocks.
[  746.759419] md: resuming recovery of md_d0 from checkpoint.
[  746.771432] md: md_d0: recovery done.
[  746.821801] RAID5 conf printout:
[  746.825049]  --- rd:5 wd:4
[  746.827754]  disk 0, o:1, dev:sdd2
[  746.831150]  disk 1, o:1, dev:sda2
[  746.834545]  disk 2, o:1, dev:sdc2
[  746.837940]  disk 3, o:0, dev:sdg2
[  746.841336]  disk 4, o:1, dev:sdb2
[  746.853983] md: recovery of RAID array md_d0
[  746.858270] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  746.864088] md: using maximum available idle IO bandwidth (but not 
more than 200000 KB/sec) for recovery.
[  746.873627] md: using 2048k window, over a total of 9429760 blocks.
[  746.879872] md: resuming recovery of md_d0 from checkpoint.
[  746.891885] md: md_d0: recovery done.
[  746.942047] RAID5 conf printout:
[  746.945301]  --- rd:5 wd:4
[  746.948008]  disk 0, o:1, dev:sdd2
[  746.951398]  disk 1, o:1, dev:sda2
[  746.954788]  disk 2, o:1, dev:sdc2
[  746.958183]  disk 3, o:0, dev:sdg2
[  746.961579]  disk 4, o:1, dev:sdb2
[  746.974231] md: recovery of RAID array md_d0
[  746.978521] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  746.984339] md: using maximum available idle IO bandwidth (but not 
more than 200000 KB/sec) for recovery.
[  746.993887] md: using 2048k window, over a total of 9429760 blocks.
[  747.000136] md: resuming recovery of md_d0 from checkpoint.
[  747.012149] md: md_d0: recovery done.
[  747.062726] RAID5 conf printout:
[  747.065976]  --- rd:5 wd:4
[  747.068683]  disk 0, o:1, dev:sdd2
[  747.072075]  disk 1, o:1, dev:sda2
[  747.075468]  disk 2, o:1, dev:sdc2
[  747.078859]  disk 3, o:0, dev:sdg2
[  747.082253]  disk 4, o:1, dev:sdb2
[  747.202204] md: recovery of RAID array md_d0
[  747.206499] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  747.212317] md: using maximum available idle IO bandwidth (but not 
more than 200000 KB/sec) for recovery.
[  747.221855] md: using 2048k window, over a total of 9429760 blocks.
[  747.228107] md: resuming recovery of md_d0 from checkpoint.
[  747.240122] md: md_d0: recovery done.
[  747.291282] RAID5 conf printout:
[  747.294530]  --- rd:5 wd:4
[  747.297238]  disk 0, o:1, dev:sdd2
[  747.300636]  disk 1, o:1, dev:sda2
[  747.304026]  disk 2, o:1, dev:sdc2
[  747.307418]  disk 3, o:0, dev:sdg2

... This continues until precisely 4 seconds, the same time it takes if 
not sending the command to fail the disk before md does it 
automatically, at which point the sd device is kicked out and the array 
becomes degraded, or too many IO errors are detected by md and the same 
fate happens.

On the other hand, the interesting point is that trying to do the above 
while also doing read IOs but without physically pulling a disk (just 
using --fail on a disk that is present, healthy, and running the read 
IOs too), everything works fine.

It seems that if you decide to send a --fail command to a disk that is 
currently in the sd or md path for error checking and/or recovery, the 
fail command will not be ignored but will instead trigger an endless 
loop of recoveries (in the above case, the recoveries are very fast 
since we are running read only IOs), and will succeed once the normal 
error checking has completed and decided to expell or kick the device out.

Would anybody know what it causing this, and if there is a way around it?

The command used to create the RAID is the following:

"mdadm --create -vvv --force --run --metadata=1.2 /dev/md/d0 --level=5 
--size=9429760 --chunk=64 --name=test_01 -n5 --bitmap=internal 
--bitmap-chunk=4096 --layout=ls /dev/sdd2 /dev/sda2 /dev/sdc2 /dev/sdg2 
/dev/sdb2"

Thank you very much in advance!

Ben.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html