On Wed, Oct 04, 2006 at 01:29:29PM -0500, Andy Warner wrote: > Muli Ben-Yehuda wrote: > > [resending as it probably hit the 100K limit the first time] > > > > I'm seeing these aic94xx IO errors on an IBM x366, usually after I > > copy ~20GB but occasionally as soon as heavy IO starts. Happens with > > and without Calgary enabled (iommu=off). I'm seeing this on two > > different disks which badblocks claims are ok. The machine usually > > stays up and keeps chugging along after this happens. > > Since you're working in this area, Not really... I just need aic94xx working reliably so that when it breaks, I can be reasonably certain it's because I broke Calgary :-) > the processing for REQ_TASK_ABORT, REQ_DEVICE_RESET, > SIGNAL_NCQ_ERROR and CLEAR_NCQ_ERROR needs fixing as all 4 events > collapse to REQ_TASK_ABORT, because sb_opcode is masked with > ~DL_PHY_MASK before the switch() in escb_tasklet_complete(). In > unpatched code, check the phy number reported in the REQ_TASK_ABORT > message: > > 0 => REQ_TASK_ABORT > 1 => REQ_DEVICE_RESET > 2 => SIGNAL_NCQ_ERROR > 3 => CLEAR_NCQ_ERROR > > So you are seeing legitimate REQ_TASK_ABORT values, but need to look > at the remaining data to see what the chip is trying to tell you. > For REQ_TASK_ABORT, status_block[1..2] is the transaction context, > and status_block[3] is the reason (TC_NO_ERROR etc from > aic94xx_sas.h) Using your patch (thanks!) I get aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason 0x05 which corresponds to TI_BREAK. Is an old firmware version ("Razor_10a1") expected to work with the aic94xx in mainline? alternatively, is the new firmware version ("V17/10c6") expected to work with older aic94xx versions? if either is true, I can try that tomorrow to see if firmware version makes a difference with the bad aic94xx. Cheers, Muli - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html