[Bug 11117] New: aic94xx doesn't sustain the load when more than 2 SAS drives are connected and actively used

bugme-daemon@xxxxxxxxxxxxxxxxxxx · Fri, 18 Jul 2008 08:37:42 -0700 (PDT)

http://bugzilla.kernel.org/show_bug.cgi?id=11117

           Summary: aic94xx doesn't sustain the load when more than 2 SAS
                    drives are connected and actively used
           Product: SCSI Drivers
           Version: 2.5
     KernelVersion: 2.6.23.1, 2.6.25
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: AIC94XX
        AssignedTo: scsi_drivers-aic94xx@xxxxxxxxxxxxxxxxxxxx
        ReportedBy: michael.gleibman@xxxxxxxxxxxxx

Latest working kernel version: NA
Earliest failing kernel version: 2.6.23.1
Distribution: Gentoo, customized; vanilla kernels
Hardware Environment: SuperMicro X7DB3 motherboard with integrated AIC 9410
controllers, 4 SAS drives connected 
Software Environment: aic94xx SAS/SATA driver version 1.0.3
Problem Description:

When more than 2 SAS drives are connected and system becomes loaded, aic94xx
logs a lot of messages similar to the following:

aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
sas: command 0xffff8101d39733c0, task 0xffff8105e9e51240, timed out:
EH_NOT_HANDLED
sas: command 0xffff8104db3d1e40, task 0xffff8105ed10a6c0, timed out:
EH_NOT_HANDLED
sas: command 0xffff81060f001380, task 0xffff81046fc0e3c0, timed out:
EH_NOT_HANDLED
sas: command 0xffff8102710e6200, task 0xffff810587fb2980, timed out:
EH_NOT_HANDLED
sas: command 0xffff8101df43f0c0, task 0xffff810235fcf9c0, timed out:
EH_NOT_HANDLED
sas: command 0xffff810182974380, task 0xffff810020939b40, timed out:
EH_NOT_HANDLED
sas: command 0xffff810591dd93c0, task 0xffff8105aab7ee00, timed out:
EH_NOT_HANDLED
sas: command 0xffff8104fa24b540, task 0xffff8105e9e51840, timed out:
EH_NOT_HANDLED
sas: command 0xffff8103a400d540, task 0xffff8102c7927e40, timed out:
EH_NOT_HANDLED
sas: command 0xffff8105f8d29cc0, task 0xffff8102c3582980, timed out:
EH_NOT_HANDLED
sas: command 0xffff8105c684c980, task 0xffff8105aebd5240, timed out:
EH_NOT_HANDLED
sas: command 0xffff8104fa24b240, task 0xffff8103140edcc0, timed out:
EH_NOT_HANDLED
sas: command 0xffff81058572cb00, task 0xffff8105aab7e980, timed out:
EH_NOT_HANDLED
sas: command 0xffff8105f08d6500, task 0xffff8100266169c0, timed out:
EH_NOT_HANDLED
sas: command 0xffff810182974500, task 0xffff8103140ed3c0, timed out:
EH_NOT_HANDLED
sas: command 0xffff810302b9a080, task 0xffff8101ca94e800, timed out:
EH_NOT_HANDLED
sas: command 0xffff81058572c080, task 0xffff8105aab7eb00, timed out:
EH_NOT_HANDLED
sas: command 0xffff8105bc7df980, task 0xffff810026616540, timed out:
EH_NOT_HANDLED
sas: Enter sas_scsi_recover_host
sas: trying to find task 0xffff8105e9e51240
sas: sas_scsi_find_task: aborting task 0xffff8105e9e51240
aic94xx: tmf tasklet complete
aic94xx: tmf resp tasklet
aic94xx: tmf came back
aic94xx: task not done, clearing nexus
aic94xx: asd_clear_nexus_tag: PRE
aic94xx: asd_clear_nexus_tag: POST
aic94xx: asd_clear_nexus_tag: clear nexus posted, waiting...
aic94xx: task 0xffff8105e9e51240 done with opcode 0x23 resp 0x0 stat 0x8d but
aborted by upper layer!
aic94xx: asd_clear_nexus_tasklet_complete: here
aic94xx: asd_clear_nexus_tasklet_complete: opcode: 0x0
aic94xx: task 0xffff8105ed10a6c0 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: task 0xffff81046fc0e3c0 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: task 0xffff810587fb2980 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: task 0xffff810235fcf9c0 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: task 0xffff810020939b40 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: task 0xffff8105aab7ee00 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: task 0xffff8105e9e51840 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: task 0xffff8102c7927e40 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: task 0xffff8102c3582980 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: task 0xffff8105aebd5240 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: task 0xffff8103140edcc0 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: task 0xffff8105aab7e980 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: task 0xffff8100266169c0 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: task 0xffff8103140ed3c0 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: task 0xffff8101ca94e800 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: task 0xffff8105aab7eb00 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: task 0xffff810026616540 done with opcode 0x0 resp 0x0 stat 0x0 but
aborted by upper layer!
aic94xx: came back from clear nexus
aic94xx: task 0xffff8105e9e51240 aborted, res: 0x0
sas: sas_scsi_find_task: task 0xffff8105e9e51240 is done
sas: sas_eh_handle_sas_errors: task 0xffff8105e9e51240 is done
sas: trying to find task 0xffff8105ed10a6c0
sas: sas_scsi_find_task: aborting task 0xffff8105ed10a6c0
aic94xx: asd_abort_task: task 0xffff8105ed10a6c0 done
aic94xx: task 0xffff8105ed10a6c0 aborted, res: 0x0
sas: sas_scsi_find_task: task 0xffff8105ed10a6c0 is done
sas: sas_eh_handle_sas_errors: task 0xffff8105ed10a6c0 is done
...
...
EXT2-fs error (device sda9): read_block_bitmap: Cannot read block bitmap -
block_group = 0, block_bitmap = 19
sd 2:0:0:0: rejecting I/O to offline device

It seems to closely resemble the problem earlier reported by Patrick
LeBoutillier: http://lkml.org/lkml/2008/6/25/305

The system would eventually freeze during stress tests.
Here's what I've tried:

- I've tried to spread the drives over 2 ports (2 drives on each port) - same
thing
- Leaving only 2 drives connected, unplugging the other 2 - the system passes
the stress tests well
- Adaptec sequencer firmware versions 1.1 (V32A4, V30) same story
- latest Adaptec bios - same thing
- plugged in an external ASC48300 controller with the same AIC-9410 chip,
connected the drives to it - the same problem as before
- plugged in an external LSI MegaRaid 8208ELP controller to the same system and
connected the same 4 drives to it - system passes all the load tests just fine
- reproduced the same behavior on another similar system, same motherboard,
different drives
- the current workaround, as suggested by Patrick, setting the PHY rate to 1.5
Gbps in Adaptec controller BIOS seems to cure the problem for both onboard and
PCI controllers at cost of some performance degradation; system is able to pass
the stress tests without errors, but the drives are a bit (~10%) slower. 

Steps to reproduce:

Plug 4 SAS drives to a system with aic9410 controller and load kernel 2.6.23.1
or 2.6.25 with aic94xx driver configured; apply some kind of stress test to all
the drives simultaneously (I used a homebrew intensive disk load test utility,
but earlier we used dd if=/dev/sd* of=/dev/null bs=16384 in parallel to create
enough load to reproduce the errors)

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html