Re: Fatal crash/hang in scsi_lib after RAID disk failure

NeilBrown <neilb@xxxxxxx> · Tue, 3 Jul 2012 17:31:45 +1000

On Tue, 3 Jul 2012 16:12:00 +0900 Christian Balzer <chibi@xxxxxxx> wrote:

> On Tue, 3 Jul 2012 16:45:28 +1000 NeilBrown wrote:
> 
> > On Tue, 3 Jul 2012 15:10:38 +0900 Christian Balzer <chibi@xxxxxxx> wrote:
> > 
> > > On Tue, 3 Jul 2012 15:50:45 +1000 NeilBrown wrote:
> > > 
> [snip]
> > > > That took *way* to long to find given how simple the fix is.
> > > 
> > > Well, given how long it takes with some OSS projects, I'd say 4 days is
> > > pretty good. ^o^
> > 
> > I meant the 4 hours of my time searching, not the 4 days of your time
> > waiting :-)
> > 
> Hehehe, if you put it that way... ^o^
> 
> > 
> > > 
> > > > I spent ages staring at the code, as about to reply and so "no idea"
> > > > when I thought I should test it myself.  Test failed immediately.
> > > 
> > > Could you elaborate a bit? 
> > > As in, was this something introduced only very recently, since I had
> > > dozens of disks fail before w/o any such pyrotechnics. 
> > > Or were there some special circumstances that triggered it? 
> > > (But looking at the patch, I guess it should have been pretty
> > > universal)
> > 
> > Bug was introduced by commit 58c54fcca3bac5bf9 which first appeared in
> > Linux 3.1.  Since then, any read error on RAID10 will trigger the bug.
> > 
> Ouch, that's a pretty substantial number of machines I'd reckon.

Could be.  But they all seem to have very reliably disks.  Except yours :-)

> 
> But now I'm even more intrigued, how do you (or the md code) define a read
> error then? 

The obvious way I guess.

> Remember this beauty here, which triggered the hunt and kill of the R10
> recovery bug of uneven member sets?

Looks like that was a write error.  They are handled quite differently.

NeilBrown

> ---
> Jun 20 18:22:01 borg03b kernel: [1383357.792044] mptscsih: ioc0: attempting task abort! (sc=ffff88023c3c5180)
> Jun 20 18:22:01 borg03b kernel: [1383357.792049] sd 8:0:3:0: [sdh] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
> Jun 20 18:22:06 borg03b kernel: [1383362.317346] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0
> x0000) cb_idx mptscsih_io_done
> Jun 20 18:22:06 borg03b kernel: [1383362.317589] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88023c3c5180)
> Jun 20 18:22:06 borg03b kernel: [1383362.567292] mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay
>  Retry}, SubCode(0x0000) cb_idx mptscsih_io_done
> Jun 20 18:22:06 borg03b kernel: [1383362.567316] mptscsih: ioc0: attempting target reset! (sc=ffff88023c3c5180)
> Jun 20 18:22:06 borg03b kernel: [1383362.567321] sd 8:0:3:0: [sdh] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
> Jun 20 18:22:06 borg03b kernel: [1383362.568040] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88023c3c5180)
> Jun 20 18:22:06 borg03b kernel: [1383362.568068] mptscsih: ioc0: attempting host reset! (sc=ffff88023c3c5180)
> Jun 20 18:22:06 borg03b kernel: [1383362.568074] mptbase: ioc0: Initiating recovery
> Jun 20 18:22:29 borg03b kernel: [1383385.440045] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88023c3c5180)
> Jun 20 18:22:29 borg03b kernel: [1383385.484846] Device returned, unsetting inDMD
> Jun 20 18:22:39 borg03b kernel: [1383395.448043] sd 8:0:3:0: Device offlined - not ready after error recovery
> Jun 20 18:22:39 borg03b kernel: [1383395.448135] sd 8:0:3:0: rejecting I/O to offline device
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] end_request: I/O error, dev sdh, sector 71
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] md: super_written gets error=-5, uptodate=0
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Disk failure on sdh1, disabling device.
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Operation continuing on 4 devices.
> ---
> That was a 3.2.18 kernel, but it didn't die and neither did the other
> cluster member with a very similar failure two weeks earlier. 
> 
> So I guess the device getting kicked out by the libsata layer below is
> fine, but it returning medium errors triggers the bug?
> 
> Anyways, time to patch stuff, thankfully this is the only production
> cluster I have with a 3.2 kernel using RAID10. ^.^;
> 
> Regards,
> 
> Christian

Attachment:
signature.asc

Description: PGP signature