[Bug 42765] New: mptscsih driver issues task aborts during high write utilization

bugzilla-daemon@xxxxxxxxxxxxxxxxxxx · Mon, 13 Feb 2012 13:36:51 GMT

https://bugzilla.kernel.org/show_bug.cgi?id=42765

           Summary: mptscsih driver issues task aborts during high write
                    utilization
           Product: SCSI Drivers
           Version: 2.5
    Kernel Version: 2.6.38-8
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: blocking
          Priority: P1
         Component: Other
        AssignedTo: scsi_drivers-other@xxxxxxxxxxxxxxxxxxxx
        ReportedBy: eric.hidle@xxxxxxxxx
        Regression: No

Created an attachment (id=72364)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=72364)
Graph of total write rate in guest during rsync data restore + mdadm rebuild
(high write utilization)

During high write utilization, a sting of errors similar to the following
appears in syslog:

Feb 13 07:42:22 Beluga kernel: [54224.040144] mptscsih: ioc1: attempting task
abort! (sc=ffff8800018dea00)
Feb 13 07:42:22 Beluga kernel: [54224.040154] sd 3:0:5:0: [sdg] CDB: Write(10):
2a 00 47 0e 55 3f 00 02 00 00
Feb 13 07:42:23 Beluga kernel: [54224.779457] mptscsih: ioc1: task abort:
SUCCESS (rv=2002) (sc=ffff8800018dea00) (sn=53380191)
Feb 13 07:42:24 Beluga kernel: [54226.018680] mptscsih: ioc1: attempting task
abort! (sc=ffff88001737a300)
Feb 13 07:42:24 Beluga kernel: [54226.018699] sd 3:0:5:0: [sdg] CDB: Write(10):
2a 00 47 0e 50 3f 00 00 08 00
Feb 13 07:42:24 Beluga kernel: [54226.018711] mptscsih: ioc1: task abort:
SUCCESS (rv=2002) (sc=ffff88001737a300) (sn=53380197)
Feb 13 07:42:24 Beluga kernel: [54226.025368] mptscsih: ioc1: attempting task
abort! (sc=ffff88002b2c7200)
Feb 13 07:42:24 Beluga kernel: [54226.025372] sd 3:0:5:0: [sdg] CDB: Write(10):
2a 00 47 0e 50 47 00 00 08 00
Feb 13 07:42:24 Beluga kernel: [54226.025382] mptscsih: ioc1: task abort:
SUCCESS (rv=2002) (sc=ffff88002b2c7200) (sn=53380199)
Feb 13 07:42:24 Beluga kernel: [54226.025556] mptscsih: ioc1: attempting task
abort! (sc=ffff88002b3b3300)
Feb 13 07:42:24 Beluga kernel: [54226.025559] sd 3:0:5:0: [sdg] CDB: Write(10):
2a 00 47 0e 50 4f 00 00 60 00
Feb 13 07:42:24 Beluga kernel: [54226.025569] mptscsih: ioc1: task abort:
SUCCESS (rv=2002) (sc=ffff88002b3b3300) (sn=53380205)
Feb 13 07:42:24 Beluga kernel: [54226.025737] mptscsih: ioc1: attempting task
abort! (sc=ffff88002b2c7900)
Feb 13 07:42:24 Beluga kernel: [54226.025740] sd 3:0:5:0: [sdg] CDB: Write(10):
2a 00 47 0e 57 3f 00 01 f0 00
Feb 13 07:42:24 Beluga kernel: [54226.025749] mptscsih: ioc1: task abort:
SUCCESS (rv=2002) (sc=ffff88002b2c7900) (sn=53380211)
Feb 13 07:42:24 Beluga kernel: [54226.025916] mptscsih: ioc1: attempting task
abort! (sc=ffff88002b385e00)
Feb 13 07:42:24 Beluga kernel: [54226.025919] sd 3:0:5:0: [sdg] CDB: Write(10):
2a 00 47 0e 59 2f 00 03 10 00
Feb 13 07:42:24 Beluga kernel: [54226.025928] mptscsih: ioc1: task abort:
SUCCESS (rv=2002) (sc=ffff88002b385e00) (sn=53380217)
Feb 13 07:42:24 Beluga kernel: [54226.026094] mptscsih: ioc1: attempting task
abort! (sc=ffff88002b385b00)
Feb 13 07:42:24 Beluga kernel: [54226.026098] sd 3:0:5:0: [sdg] CDB: Write(10):
2a 00 2f 7f ad bf 00 00 08 00
Feb 13 07:42:24 Beluga kernel: [54226.026107] mptscsih: ioc1: task abort:
SUCCESS (rv=2002) (sc=ffff88002b385b00) (sn=53380218)
Feb 13 07:42:24 Beluga kernel: [54226.026271] mptscsih: ioc1: attempting task
abort! (sc=ffff88001737ad00)
Feb 13 07:42:24 Beluga kernel: [54226.026274] sd 3:0:5:0: [sdg] CDB: Write(10):
2a 00 2f 7f ad c7 00 00 08 00
Feb 13 07:42:24 Beluga kernel: [54226.026283] mptscsih: ioc1: task abort:
SUCCESS (rv=2002) (sc=ffff88001737ad00) (sn=53380224)

Simultaneous with this output in the Linux Guest Syslog, the following appear
in the VMWare Kernel Log:

2012-02-13T12:42:21.677Z cpu6:65683)<6>mptscsih: ioc0: attempting task abort!
(sc=0x4124015017c0)
2012-02-13T12:42:21.677Z cpu6:65683)MPT SAS Host:8:0:4:0 ::
<6>        command: Write(10): 2a 00 47 0e 55 bf 00 00 80 00
2012-02-13T12:42:22.141Z cpu1:2049)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x2a
(0x412400728580) to dev "naa.50024e92063340f2" on path "vmhba3:C0:T4:L0"
Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.Act:EVAL
2012-02-13T12:42:22.141Z cpu1:2049)WARNING: NMP:
nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.50024e92063340f2" state in
doubt; requested fast path state update...
2012-02-13T12:42:22.141Z cpu1:2049)ScsiDeviceIO: 2305: Cmd(0x412400728580)
0x2a, CmdSN 0x800e0069 to dev "naa.50024e92063340f2" failed H:0x8 D:0x0 P:0x0
Possible sense data: 0x0 0x0 0x0.
2012-02-13T12:42:22.141Z cpu1:2049)<6>mptbase: ioc0: LogInfo(0x31140000):
Originator={PL}, Code={IO Executed}, SubCode(0x0000)
2012-02-13T12:42:22.142Z cpu3:100554)WARNING: LinScsi:
SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1055 Host Busy
vmhba3:0:4:0 (driver name: MPT SAS Host) - Message repeated 1 time
2012-02-13T12:42:22.142Z cpu6:65683)<6>mptscsih: ioc0: task abort: SUCCESS
(sc=0x4124015017c0)
2012-02-13T12:42:22.142Z cpu6:65683)<6>mptscsih: ioc0: attempting task abort!
(sc=0x4124014d6380)
2012-02-13T12:42:22.142Z cpu6:65683)MPT SAS Host:8:0:4:0 ::
<6>        command: Write(10): 2a 00 47 0e 56 3f 00 00 80 00
2012-02-13T12:42:22.142Z cpu6:65683)<6>mptscsih: ioc0: task abort: SUCCESS
(sc=0x4124014d6380)
2012-02-13T12:42:22.142Z cpu6:65683)<6>mptscsih: ioc0: attempting task abort!
(sc=0x41240141ba80)
2012-02-13T12:42:22.142Z cpu6:65683)MPT SAS Host:8:0:4:0 ::
<6>        command: Write(10): 2a 00 47 0e 56 bf 00 00 80 00
2012-02-13T12:42:22.142Z cpu6:65683)<6>mptscsih: ioc0: task abort: SUCCESS
(sc=0x41240141ba80)
2012-02-13T12:42:23.397Z cpu3:2171)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x0
(0x412400720d40) to dev "naa.50024e92063340f2" on path "vmhba3:C0:T4:L0"
Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.Act:NONE
2012-02-13T12:42:23.397Z cpu3:2171)ScsiDeviceIO: 2305: Cmd(0x412400720d40) 0x0,
CmdSN 0x800e0061 to dev "naa.50024e92063340f2" failed H:0x0 D:0x2 P:0x0 Valid
sense data: 0x6 0x29 0x0.
2012-02-13T12:42:23.397Z cpu3:2171)ScsiCore: 1455: Power-on Reset occurred on
naa.50024e92063340f2

This has occurred on several of the disks attached to the LSI 1068E controller
in the system. All disks are Samsung HD204UI. O/S is Ubuntu 11.04 Server
running in VMWare ESXi 5.0 with all 6 drives attached to the guest via Raw
Device Mapping, and assembled into a RAID5 array using mdadm. 

When a hard disk undergoes POR, it can fall out of an mdadm array, causing
permanent data loss. We have seen one occurrence of a "Rebuild20" event from
mdadm in the Guest syslog. The POR causes all writes to the array to stop, long
enough to show up in the ESXi disk performance graph (attached image).

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html