On Tue, 2008-12-16 at 14:56 -0500, Alan Stern wrote: > On Tue, 16 Dec 2008, James Bottomley wrote: > > > It would be nice ... unfortunately, errors tend to migrate around the > > categories because of empirical results, so such a document could never > > be definitive. > > > > As a rule of thumb: errors which produced an action in the device which > > errors but is retryable count down the retries (things like parity/crc > > errors on the bus). Things which would produce no benefit retrying > > (like Medium errors) get failed immediately and things which indicate > > transient resource issues either in the device (QUEUE_FULL) or the host > > (DID_REQUEUE) get retried upto the timeout limit with a suitable backoff > > (mostly we refuse to issue more commands until one returns). > > It would be wonderful if Mike or someone else would implement this > scheme. The necessary changes shouldn't be very extensive. > > (And I still think the wait_for logic in scsi_softirq_done() is wrong; > rq->timeout shouldn't be multiplied by cmd->allowed.) It's the logical retry timeout ... if a command fails and times out it gets retried, if it continues to time out, retries*timeout is the max before it fails. > > > So the whole idea of the retry_hwerr flag is bogus; hardware errors > > > should always be retried. Or perhaps only the name is bogus, since > > > the > > > flag really indicates that the command should be tried over and over > > > again without pause until it succeeds or the request times out > > > (whereas > > > normally hardware errors should be retried only a few times). > > > > It doesn't actually; the MLQUEUE return blocks the device from further > > issue until a command returns (or, if empty issue queue, until I/O > > pressure causes a block unplug 3 times). > > Okay, I misunderstood how that works. Still, the code bypasses the > normal retry pathways, leaving it vulnerable to these sorts of > problems. It does? How? decide_disposition() goes straigh into the expiry check. > So why put the retry_hwerr test in check_sense()? Why not > put it in scsi_io_completion() instead, so that retries can be limited > appropriately? because check_sense goes into decide disposition which gets the timeout test applied on MLQUEUE return. > BTW, what happens if the issue queue is empty and there is no I/O > pressure? Then the command wouldn't be retried at all, it would just > time out. That doesn't seem like what you want. I/O pressure is proportional to the size of the request queue. By definition, a requeue means at least 1 outstnading request and thus some pressure. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html