Re: [PATCH 0/5] mpt fusion error handler patches

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



	Hello Bernd & All ,

On Fri, 12 Sep 2008, Bernd Schubert wrote:
Hello,
I'm going to submit several error handler patches for the MPT fusion
driver. The purpose of these patches is mainly to fix errors happening
on the second port of dual port 53C1030 based HBAs.
As I complained some time ago on this list, a device failure on one of the
ports of LSI22320R HBAs, will also cause device failures of innocent devices
on the other port of this HBA. In order to debug this Eric Moore sent me a
fusion-tip version of this driver, which we have been using ever since. However,
this version has issues with SAS HBAs and probably also won't work for recent kernel
versions. So I spent quite some amount of time to figure out why fusion-tip
version (4.x) of the driver doesn't have the issue.

Below I will provide the some examples of these issues. Errors on one of the attached
scsi devices have been simulated using lsiutil by doing target of one of the attached
devices on one of the port (5 0 4 0).

Unpatched 2.6.26 + a few scsi diagnostics and error handler patches:

[  224.819697] sd 5:0:4:0: last recovery: 4294911483, now: 4294948403
[  224.826142] sd 5:0:4:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK
[  224.831676] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 0c 27 2e 98 00 00 04 00 00 00
[  224.842803] sd 5:0:4:0: Activating scsi error recovery (1)
[  224.857824] sd 5:0:4:0: trying to abort command
[  224.865697] mptscsih: ioc1: attempting task abort! (sc=ffff8100f8f10000)
[  224.870572] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 0c 27 2e 98 00 00 04 00 00 00
[  227.047968] mptbase: ioc1: Initiating recovery
[  229.481849] sd 5:0:4:0: mptscsih: ioc1: completing cmds: fw_channel 0, fw_id 4, sc=ffff8100f8fbb180, mf = ffff8100
[...]
[  364.322013] mptscsih: ioc1: bus reset: SUCCESS (sc=ffff8100f8f11b80)
[  371.924342] sd 4:0:2:0: scmd retry 6/6
[  371.928364] sd 4:0:2:0: last recovery: 0, now: 4294985148
[  371.932924] sd 4:0:2:0: [sda] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
[  371.932924] sd 4:0:2:0: [sda] CDB: Write(16): 8a 00 00 00 00 01 31 8b 4a 4e 00 00 00 39 00 00
[  371.932924] sd 4:0:2:0: Activating scsi error recovery (1)
[  371.960382] sd 4:0:2:0: Sending BDR 0xffff81007eaf2538
[  371.984936] sd 4:0:2:0: trying device reset
[  371.989426] mptscsih: ioc0: attempting target reset! (sc=ffff81007eb7c780)

As you can see, suddenly also target 4 0 2 0 fails, which is ioc0. In the end:

[  398.596119] sd 4:0:2:0: [sda] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
[  398.605291] end_request: I/O error, dev sda, sector 5126179406
[  398.612360] end_request: I/O error, dev sda, sector 5126179406
[  398.617818]  target4:0:2: Beginning Domain Validation

So the innocent device sda (which is really another device) failed.

Now the same with patches applied, but with the soft reset-handler deactivated:

[  912.861708] sd 5:0:4:0: last recovery: 4295082734, now: 4295120387
[  912.868130] sd 5:0:4:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_

[  912.873757] sd 5:0:4:0: [sdc] CDB: Write(10): 2a 00 73 11 33 08 00 04 00 00
[  912.873757] sd 5:0:4:0: Activating scsi error recovery (2)
[  912.889492] sd 5:0:4:0: trying to abort command
[  912.894118] mptscsih: ioc1: attempting task abort! (sc=ffff8100e361d180)
[  912.900951] sd 5:0:4:0: [sdc] CDB: Write(10): 2a 00 73 11 33 08 00 04 00 00
[  913.025771] mptscsih: ioc1: task abort: FAILED (sc=ffff8100e361d180)
[  913.032269] sd 5:0:4:0: Sending BDR 0xffff8100f99e1428
[  913.040264] sd 5:0:4:0: trying device reset
[  913.044597] mptscsih: ioc1: attempting target reset! (sc=ffff8100e361d180)
[  913.049955] sd 5:0:4:0: [sdc] CDB: Write(10): 2a 00 73 11 33 08 00 04 00 00
[  913.177284] mptscsih: ioc1: target reset: FAILED (sc=ffff8100e361d180)
[  913.181946] Sending BRST chan: 0
[  913.185945] sd 5:0:4:0: trying bus reset
[  913.189974] mptscsih: ioc1: attempting bus reset! (sc=ffff8100e361d180)
[  913.197310] sd 5:0:4:0: [sdc] CDB: Write(10): 2a 00 73 11 33 08 00 04 00 00
[  913.325079] mptscsih: ioc1: bus reset: FAILED (sc=ffff8100e361d180)
[  913.329668] sd 5:0:4:0: trying host reset
[  913.333864] mptscsih: ioc1: attempting host reset! (sc=ffff8100e361d180)
[  913.341832] mptscsih: ioc1: Skipping hard reset in order to prevent failures on ioc

[  913.349821] mptscsih: ioc1: host reset: FAILED (sc=ffff8100e361d180)
[  913.356704] sd 5:0:4:0: Device offlined - not ready after error recovery
[  913.363199] sd 5:0:4:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK

=> The device was not recovered, but at least 4 0 2 0 didn't fail :)

Now with all patches applied:

[  214.903699] sd 5:0:4:0: last recovery: 0, now: 4294945953
[  214.910652] sd 5:0:4:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK
[  214.918652] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 31 8b 9c e7 00 00 00 39 00 00
[  214.918652] sd 5:0:4:0: Activating scsi error recovery (1)
[  214.934655] sd 5:0:4:0: trying to abort command
[  214.939581] mptscsih: ioc1: attempting task abort! (sc=ffff8100f9be0c80)
[  214.947581] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 31 8b 9c e7 00 00 00 39 00 00
[  215.077430] mptscsih: ioc1: task abort: FAILED (sc=ffff8100f9be0c80)
[  215.083645] sd 5:0:4:0: Sending BDR 0xffff81007eb51428
[  215.090298] sd 5:0:4:0: trying device reset
[  215.094810] mptscsih: ioc1: attempting target reset! (sc=ffff8100f9be0c80)
[  215.101917] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 31 8b 9c e7 00 00 00 39 00 00
[  215.229659] mptscsih: ioc1: target reset: FAILED (sc=ffff8100f9be0c80)
[  215.236367] Sending BRST chan: 0
[  215.240173] sd 5:0:4:0: trying bus reset
[  215.244313] mptscsih: ioc1: attempting bus reset! (sc=ffff8100f9be0c80)
[  215.251731] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 31 8b 9c e7 00 00 00 39 00 00
[  215.382449] mptscsih: ioc1: bus reset: FAILED (sc=ffff8100f9be0c80)
[  215.388946] sd 5:0:4:0: trying host reset
[  215.393162] mptscsih: ioc1: attempting host reset! (sc=ffff8100f9be0c80)
[  215.400489] sd 5:0:4:0: mptscsih: ioc1: completing cmds: fw_channel 0, fw_id 4, sc=ffff8100f9be0c80, mf = ffff8105
[  217.317914] mptbase: ioc1: SoftResetHandler: completed (1 seconds): SUCCESS
[  217.324924] mptscsih: ioc1: host reset: SUCCESS (sc=ffff8100f9be0c80)
[  227.546452]  target5:0:4: Beginning Domain Validation
[  227.578775]  target5:0:4: Ending Domain Validation
[  227.584099]  target5:0:4: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS PCOMP (6.25 ns, offset 127)
[  227.596959]  target5:0:5: Beginning Domain Validation
[  227.651196]  target5:0:5: Ending Domain Validation
[  227.656977]  target5:0:5: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS PCOMP (6.25 ns, offset 127)

Thank you Bernd for tracking this down . I have run into this very issue every time one of my drives starts going bad on me & I reboot to see if the errors are actually just a bad firmware that requires a power reset to clear up . But when it is a trully failing drive (or I am testing a new type of cabling which shows to be inferior) then these rolling resets of the controller -> channel -> device itself , caused me no end of impatient waiting for them to end , So far they do eventually end . In some earlier driver versions they did NOT timeout , This usually told me that something was amiss & I had to just hit an dmiss change things out trying to find the actual culrpit , There always was a culprit in the chain someplace .

I would also like to Thank "The LSI team" for creating this in kernel (& module) driver for their line of fusion cards (& fixing atto's as well) , as well as for maintaining it & putting up with my pissing and moaning about this and some other issues that had cropped up .

	Again Thank you all ,  JimL
--
+------------------------------------------------------------------+
| James   W.   Laferriere | System    Techniques | Give me VMS     |
| Network&System Engineer | 2133    McCullam Ave |  Give me Linux  |
| babydr@xxxxxxxxxxxxxxxx | Fairbanks, AK. 99701 |   only  on  AXP |
+------------------------------------------------------------------+
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux