Re: [PATCH] Fix handling of failed requests in scsi_io_completion

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Sun, 21 Sep 2008 12:30:30 -0400 (EDT)

On Sat, 20 Sep 2008, James Bottomley wrote:

> > Except that those places don't do the blk_noretry_request test.  Should 
> > they?  And even if they do, what's to prevent an infinite retry loop?
> 
> Well, that's another oddity ... the retries occur at a lower level
> (scsi_decide_disposition) ... there's no reason to make that check if
> all the error paths complete everything.

Are you referring to the NEEDS_RETRY case in scsi_softirq_done?  Yet
another looping mechanism!  This one requeues the scsi_cmnd, whereas I
was talking about requeuing the request.  However either one could in
theory lead to unending retries.

For example, suppose a buggy device (without removable media) always
replies with UNIT ATTENTION without making any forward progress.  
We'll just call scsi_requeue_command each time and get stuck.

>  Now if we find an error where
> we should be moving on to the next transaction, then we'd need to do
> that test.  However, I think I've demonstrated so far that there isn't
> one.  I also don't think we want to make the test for the obviously
> retryable conditions like UNIT_ATTENTION because that will cause paths
> to flip on irrelevant AEN conditions.

To be honest, I don't know what sort of requests get marked as
non-retryable in the block layer.  Maybe you're right and we don't need 
to worry about them.

But I'm still concerned about the possibility of getting stuck doing
the same command or request over and over.  Both structures have a
"retries" field, but I'm not clear on how/where they get used.

> > I take your point.  Which leads to the question: Why was the code ever 
> > calling scsi_end_request(cmd, -EIO, this_count, 1) in the first place?  
> > Apparently all of these paths should be setting the last argument to 0, 
> > always.
> 
> Yes, that was the question I asked in my first reply (where I said the
> requeue looks superfluous).  I think the answer is that there's no
> point ... that's why I advocated simply eliminating scsi_end_request()
> in favour of either scsi_requeue_request/blk_run_queue or

You mean just scsi_requeue_command?  There is no scsi_requeue_request.

> end_dequeued_request().

Yes.  Those new "goto"s in scsi_io_completion could be rearranged to 
end up calling either scsi_requeue_command or end_dequeued_request.  
That still leaves the initial call to scsi_end_request; I guess this is 
where you would move that routine's contents into scsi_io_completion.

> So are you happy with the simple fix proposal ... I think rearranging
> this code needs more debate and discussion.

I tested your simple fix, and it does indeed solve the problem of tasks
hanging because of an uncompleted request.  In view of Boaz's concerns,
should this change be postponed until 2.6.27.stable so that it can get
wider testing?

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html