Re: [PATCH] Fix handling of failed requests in scsi_io_completion

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Sun, 21 Sep 2008 14:59:37 -0500

On Sun, 2008-09-21 at 12:30 -0400, Alan Stern wrote:
> On Sat, 20 Sep 2008, James Bottomley wrote:
> 
> > > Except that those places don't do the blk_noretry_request test.  Should 
> > > they?  And even if they do, what's to prevent an infinite retry loop?
> > 
> > Well, that's another oddity ... the retries occur at a lower level
> > (scsi_decide_disposition) ... there's no reason to make that check if
> > all the error paths complete everything.
> 
> Are you referring to the NEEDS_RETRY case in scsi_softirq_done?  Yet
> another looping mechanism!  This one requeues the scsi_cmnd, whereas I
> was talking about requeuing the request.  However either one could in
> theory lead to unending retries.
> 
> For example, suppose a buggy device (without removable media) always
> replies with UNIT ATTENTION without making any forward progress.  
> We'll just call scsi_requeue_command each time and get stuck.

That's a separate bug from the current one ... fortunately one I don't
think we've actually seen manifest.

> >  Now if we find an error where
> > we should be moving on to the next transaction, then we'd need to do
> > that test.  However, I think I've demonstrated so far that there isn't
> > one.  I also don't think we want to make the test for the obviously
> > retryable conditions like UNIT_ATTENTION because that will cause paths
> > to flip on irrelevant AEN conditions.
> 
> To be honest, I don't know what sort of requests get marked as
> non-retryable in the block layer.  Maybe you're right and we don't need 
> to worry about them.

They tend to be device mapper ones.  Anything that wants a fast failure
to do path switch over for instance.

> But I'm still concerned about the possibility of getting stuck doing
> the same command or request over and over.  Both structures have a
> "retries" field, but I'm not clear on how/where they get used.

Block relies on the lower layers for retry ... it just transmits the
status, so we get to fix it.

> > > I take your point.  Which leads to the question: Why was the code ever 
> > > calling scsi_end_request(cmd, -EIO, this_count, 1) in the first place?  
> > > Apparently all of these paths should be setting the last argument to 0, 
> > > always.
> > 
> > Yes, that was the question I asked in my first reply (where I said the
> > requeue looks superfluous).  I think the answer is that there's no
> > point ... that's why I advocated simply eliminating scsi_end_request()
> > in favour of either scsi_requeue_request/blk_run_queue or
> 
> You mean just scsi_requeue_command?  There is no scsi_requeue_request.

Yes.

> > end_dequeued_request().
> 
> Yes.  Those new "goto"s in scsi_io_completion could be rearranged to 
> end up calling either scsi_requeue_command or end_dequeued_request.  
> That still leaves the initial call to scsi_end_request; I guess this is 
> where you would move that routine's contents into scsi_io_completion.
> 
> > So are you happy with the simple fix proposal ... I think rearranging
> > this code needs more debate and discussion.
> 
> I tested your simple fix, and it does indeed solve the problem of tasks
> hanging because of an uncompleted request.  In view of Boaz's concerns,
> should this change be postponed until 2.6.27.stable so that it can get
> wider testing?

We can ... I think it's safe enough though given it only affects
multiple transaction commands.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html