Re: [PATCH] SCSI: handle HARDWARE_ERROR sense correctly

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Tue, 16 Dec 2008 14:56:59 -0500 (EST)

On Tue, 16 Dec 2008, James Bottomley wrote:

> It would be nice ... unfortunately, errors tend to migrate around the
> categories because of empirical results, so such a document could never
> be definitive.
> 
> As a rule of thumb:  errors which produced an action in the device which
> errors but is retryable count down the retries (things like parity/crc
> errors on the bus).  Things which would produce no benefit retrying
> (like Medium errors) get failed immediately and things which indicate
> transient resource issues either in the device (QUEUE_FULL) or the host
> (DID_REQUEUE) get retried upto the timeout limit with a suitable backoff
> (mostly we refuse to issue more commands until one returns).

It would be wonderful if Mike or someone else would implement this
scheme.  The necessary changes shouldn't be very extensive.

(And I still think the wait_for logic in scsi_softirq_done() is wrong; 
rq->timeout shouldn't be multiplied by cmd->allowed.)

> > So the whole idea of the retry_hwerr flag is bogus; hardware errors
> > should always be retried.  Or perhaps only the name is bogus, since
> > the
> > flag really indicates that the command should be tried over and over
> > again without pause until it succeeds or the request times out
> > (whereas
> > normally hardware errors should be retried only a few times).
> 
> It doesn't actually; the MLQUEUE return blocks the device from further
> issue until a command returns (or, if empty issue queue, until I/O
> pressure causes a block unplug 3 times).

Okay, I misunderstood how that works.  Still, the code bypasses the 
normal retry pathways, leaving it vulnerable to these sorts of 
problems.  So why put the retry_hwerr test in check_sense()?  Why not 
put it in scsi_io_completion() instead, so that retries can be limited 
appropriately?

BTW, what happens if the issue queue is empty and there is no I/O
pressure?  Then the command wouldn't be retried at all, it would just
time out.  That doesn't seem like what you want.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html