Re: [PATCH] SCSI: handle HARDWARE_ERROR sense correctly

Mike Anderson <andmike@xxxxxxxxxxxxxxxxxx> · Thu, 4 Dec 2008 15:39:42 -0800

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote:
> On Thu, 4 Dec 2008, James Bottomley wrote:
> 
> > On Thu, 2008-12-04 at 15:49 -0500, Alan Stern wrote:
> > > This patch (as1183) fixes a bug in scsi_check_sense().  The routine is
> > > documented as returning one of SUCCESS, FAILED, or NEEDS_RETRY.  But
> > > in the HARDWARE_ERROR case it can return ADD_TO_MLQUEUE.  And since it
> > > does this without bothering to increment the retry count, it can lead
> > > to an infinite retry loop.
> > > 
> > > The fix is to return NEEDS_RETRY instead.  Then the caller,
> > > scsi_decide_disposition(), will do the right thing.
> > 
> > OK, but why?
> > 
> > The current behaviour is to retry the error until the command timeout
> > expires, which, I think is what was needed by the annoying arrays that
> > have retryable hardware errors.
> > 

Yes there are some arrays that need this behavior. The two users: 
usb disks and the devinfo entries with BLIST_RETRY_HWERROR appear to have
two different expected behaviors.

> > What bug would this patch fix?  Because I can see it causing problems
> > with the arrays that originally reported this problem.
> 
> You're right and I was wrong.  It only _appeared_ to be an infinite
> retry loop because there were 6 iterations allowed and each iteration
> had a 60-second timeout.  So I retract the patch submission.
> 
> Waiting six minutes for a command to complete is a bit ridiculous,
> though -- especially when the user program generating that command
> decides to reissue it a few times.  Can we do anything to expedite the
> failure?

A short term hack until retry policy is aligned and updated would be to
use the sdev_bflags to differentiate the usb case from use of the
BLIST_RETRY_HWERROR bflag.

> For example, does it really make sense for scsi_softirq_done
> to multiply cmd->allowed by rq->timeout?  After all, if a command
> aborts with a timeout instead of failing outright, what point is there
> in retrying it?  The proper approach would have been to use a longer 
> timeout initially.
> 

The wait_for is used for more than retries of timeouts.

I had thought it might be a good idea to expose the wait_for value and
then users could control the wait_for behavior if needed by using a udev
rule to set it near the IO timeout value if so required.

> By the way, is there any reason to keep scmd->retries now?  If commands
> are going to be retried until they timeout, why bother to count the
> retries? 

Previously I had submitted some patches on scsi mid retry with a short text
on current retry policy (this cover mid retry policy vs
scsi_io_completion, which should be unified).

http://marc.info/?l=linux-scsi&m=122210133628085&w=2

I will try to refresh my patches with a updated policy document and also
align that with the changes to scsi_io_completion posted prior to
re-submit.

-andmike
--
Michael Anderson
andmike@xxxxxxxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html