Re: aic94xx driver woes continued

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Thu, 20 Mar 2008 14:57:07 -0500

On Thu, 2008-03-20 at 20:15 +0100, Raoul Bhatia [IPAX] wrote:
> James Bottomley wrote:
> > This is all normal.  Seagate drives are known for throwing protocol
> > errors under stress at certain revs of firmware.  That's what
> > REQ_TASK_ABORT, reason=0x6 is.
> > 
> > Your logs indicate that the recovery occurred correctly (as in all tasks
> > were eventually retried), so it doesn't show an actual problem.
> 
> ok, i already filed a trouble ticket at seagate - lets see if they
> provide a firmware update for the disks. afaik mine is "firmware 0002"
> 
> >> sometimes even a disk is kicked out of the raid configuration.
> > 
> > This would be abnormal, if you have a log of this, could you post it.  I
> > assume it was because of I/O errors?
> 
> i attached a bigger syslog file (.gz format).

OK, this looks more definitive, thanks!

What appears to be happening is that you get a run of protocol errors,
not necessarily all on the same command, but what happens every time (by
current design of the aic94xx driver) is that we halt the aic94xx, abort
all the outstanding commands and resubmit them.  Because the disk is
being hammered, there are rather a lot, so all it takes is five protocol
errors in a few seconds for one unlucky command to get aborted five
times (not necessarily through any fault of its own) and run out of
retries.  This causes it to return to the upper layers with DID_ABORT
and be treated as an I/O error.

A work around might be to lower the queue depth to say 4 or 8 and up the
retries (this latter can only be done by altering the SD_MAX_RETRIES
parameter in include/scsi/sd.h and recompiling).

Longer term, I think REQ_TASK_ABORT needs to be handled better on the
fly.  What we should do is abort only the task we've been asked to abort
and return it to the upper layer for a retry without invoking the error
handler ... I can look into this, but it will take a while.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html