On Tue, 3 Jul 2012 16:12:00 +0900 Christian Balzer <chibi@xxxxxxx> wrote: > On Tue, 3 Jul 2012 16:45:28 +1000 NeilBrown wrote: > > > On Tue, 3 Jul 2012 15:10:38 +0900 Christian Balzer <chibi@xxxxxxx> wrote: > > > > > On Tue, 3 Jul 2012 15:50:45 +1000 NeilBrown wrote: > > > > [snip] > > > > That took *way* to long to find given how simple the fix is. > > > > > > Well, given how long it takes with some OSS projects, I'd say 4 days is > > > pretty good. ^o^ > > > > I meant the 4 hours of my time searching, not the 4 days of your time > > waiting :-) > > > Hehehe, if you put it that way... ^o^ > > > > > > > > > > I spent ages staring at the code, as about to reply and so "no idea" > > > > when I thought I should test it myself. Test failed immediately. > > > > > > Could you elaborate a bit? > > > As in, was this something introduced only very recently, since I had > > > dozens of disks fail before w/o any such pyrotechnics. > > > Or were there some special circumstances that triggered it? > > > (But looking at the patch, I guess it should have been pretty > > > universal) > > > > Bug was introduced by commit 58c54fcca3bac5bf9 which first appeared in > > Linux 3.1. Since then, any read error on RAID10 will trigger the bug. > > > Ouch, that's a pretty substantial number of machines I'd reckon. Could be. But they all seem to have very reliably disks. Except yours :-) > > But now I'm even more intrigued, how do you (or the md code) define a read > error then? The obvious way I guess. > Remember this beauty here, which triggered the hunt and kill of the R10 > recovery bug of uneven member sets? Looks like that was a write error. They are handled quite differently. NeilBrown > --- > Jun 20 18:22:01 borg03b kernel: [1383357.792044] mptscsih: ioc0: attempting task abort! (sc=ffff88023c3c5180) > Jun 20 18:22:01 borg03b kernel: [1383357.792049] sd 8:0:3:0: [sdh] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00 > Jun 20 18:22:06 borg03b kernel: [1383362.317346] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0 > x0000) cb_idx mptscsih_io_done > Jun 20 18:22:06 borg03b kernel: [1383362.317589] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88023c3c5180) > Jun 20 18:22:06 borg03b kernel: [1383362.567292] mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay > Retry}, SubCode(0x0000) cb_idx mptscsih_io_done > Jun 20 18:22:06 borg03b kernel: [1383362.567316] mptscsih: ioc0: attempting target reset! (sc=ffff88023c3c5180) > Jun 20 18:22:06 borg03b kernel: [1383362.567321] sd 8:0:3:0: [sdh] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00 > Jun 20 18:22:06 borg03b kernel: [1383362.568040] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88023c3c5180) > Jun 20 18:22:06 borg03b kernel: [1383362.568068] mptscsih: ioc0: attempting host reset! (sc=ffff88023c3c5180) > Jun 20 18:22:06 borg03b kernel: [1383362.568074] mptbase: ioc0: Initiating recovery > Jun 20 18:22:29 borg03b kernel: [1383385.440045] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88023c3c5180) > Jun 20 18:22:29 borg03b kernel: [1383385.484846] Device returned, unsetting inDMD > Jun 20 18:22:39 borg03b kernel: [1383395.448043] sd 8:0:3:0: Device offlined - not ready after error recovery > Jun 20 18:22:39 borg03b kernel: [1383395.448135] sd 8:0:3:0: rejecting I/O to offline device > Jun 20 18:22:39 borg03b kernel: [1383395.452063] end_request: I/O error, dev sdh, sector 71 > Jun 20 18:22:39 borg03b kernel: [1383395.452063] md: super_written gets error=-5, uptodate=0 > Jun 20 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Disk failure on sdh1, disabling device. > Jun 20 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Operation continuing on 4 devices. > --- > That was a 3.2.18 kernel, but it didn't die and neither did the other > cluster member with a very similar failure two weeks earlier. > > So I guess the device getting kicked out by the libsata layer below is > fine, but it returning medium errors triggers the bug? > > Anyways, time to patch stuff, thankfully this is the only production > cluster I have with a 3.2 kernel using RAID10. ^.^; > > Regards, > > Christian
Attachment:
signature.asc
Description: PGP signature