Re: aic94xx driver woes continued

Luben Tuikov <ltuikov@xxxxxxxxx> · Sat, 29 Mar 2008 15:39:18 -0700 (PDT)

--- On Thu, 3/20/08, James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
> On Thu, 2008-03-20 at 20:15 +0100, Raoul Bhatia [IPAX]
> wrote:
> > James Bottomley wrote:
> > > This is all normal.  Seagate drives are known for
> throwing protocol
> > > errors under stress at certain revs of firmware. 
> That's what
> > > REQ_TASK_ABORT, reason=0x6 is.
> > > 
> > > Your logs indicate that the recovery occurred
> correctly (as in all tasks
> > > were eventually retried), so it doesn't show
> an actual problem.
> > 
> > ok, i already filed a trouble ticket at seagate - lets
> see if they
> > provide a firmware update for the disks. afaik mine is
> "firmware 0002"
> > 
> > >> sometimes even a disk is kicked out of the
> raid configuration.
> > > 
> > > This would be abnormal, if you have a log of
> this, could you post it.  I
> > > assume it was because of I/O errors?
> > 
> > i attached a bigger syslog file (.gz format).
> 
> OK, this looks more definitive, thanks!
> 
> What appears to be happening is that you get a run of
> protocol errors,
> not necessarily all on the same command, but what happens
> every time (by
> current design of the aic94xx driver) is that we halt the
> aic94xx, abort
> all the outstanding commands and resubmit them.  Because
> the disk is
> being hammered, there are rather a lot, so all it takes is
> five protocol
> errors in a few seconds for one unlucky command to get
> aborted five
> times (not necessarily through any fault of its own) and
> run out of
> retries.  This causes it to return to the upper layers with
> DID_ABORT
> and be treated as an I/O error.
> 
> A work around might be to lower the queue depth to say 4 or
> 8 and up the
> retries (this latter can only be done by altering the
> SD_MAX_RETRIES
> parameter in include/scsi/sd.h and recompiling).
> 
> Longer term, I think REQ_TASK_ABORT needs to be handled
> better on the
> fly.  What we should do is abort only the task we've
> been asked to abort
> and return it to the upper layer for a retry without
> invoking the error
> handler ... I can look into this, but it will take a while.

The original driver, from which you forked off, has always supported
this correct (SCSI) behaviour.

   Luben

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html