aic94xx + ST3146855SS still failing under heavy load

Leonid Kalmankin <lvk@xxxxxxxxxxxxx> · Mon, 14 Apr 2008 21:03:37 +0400

Hello!

We have a system with:

vanilla 2.6.25-rc8 (2.6.23, 2.6.24 have the same behaviour)

Adaptec AIC-9410W SAS (Razor ASIC RAID) (rev 09)
aic94xx: Found sequencer Firmware version 1.1 (V30)
  (Firmware version 1.1 (V17/10c6) makes no difference)
scsi 2:0:0:0: Direct-Access  SEAGATE ST3146855SS 0002 PQ: 0 ANSI: 5

It reliably fails under heavy IO:

> sas: command 0xffff81022c5f5640, task 0xffff8101f6b0f000, timed out: EH_NOT_HANDLED
> sas: command 0xffff81022c5f5500, task 0xffff8101f6b0f1c0, timed out: EH_NOT_HANDLED
> ....
> sas: Enter sas_scsi_recover_host
> sas: trying to find task 0xffff8101f6b0f000
> sas: sas_scsi_find_task: aborting task 0xffff8101f6b0f000
> aic94xx: task 0xffff8101f6b0f000 done with opcode 0x1e resp 0x0 stat 0x8d but aborted by upper layer!
> aic94xx: tmf tasklet complete
> aic94xx: tmf came back
> aic94xx: asd_abort_task: task 0xffff8101f6b0f000 done
> aic94xx: task 0xffff8101f6b0f000 aborted, res: 0x0
> sas: sas_scsi_find_task: task 0xffff8101f6b0f000 is done
> sas: sas_eh_handle_sas_errors: task 0xffff8101f6b0f000 is done
> sas: --- Exit sas_scsi_recover_host

Sometimes it successfully recovers; sometimes the disk is lost until the reboot.

I've read http://archive.netbsd.se/?ml=linux-scsi&a=2008-01&t=6260524
Asked Seagate about firmware update; they told me they do not have any.

As I understood, the root of this problem is protocol errors in disk's firmware
(other disks, for example FUJITSU MBA3147RC work fine); however, that kind of errors
should be recoverable by sas/aic94xx drivers.

If that is true, I could test some patches/ideas, where should I start?

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html