On Sun, 2008-09-21 at 12:30 -0400, Alan Stern wrote: > On Sat, 20 Sep 2008, James Bottomley wrote: > > > > Except that those places don't do the blk_noretry_request test. Should > > > they? And even if they do, what's to prevent an infinite retry loop? > > > > Well, that's another oddity ... the retries occur at a lower level > > (scsi_decide_disposition) ... there's no reason to make that check if > > all the error paths complete everything. > > Are you referring to the NEEDS_RETRY case in scsi_softirq_done? Yet > another looping mechanism! This one requeues the scsi_cmnd, whereas I > was talking about requeuing the request. However either one could in > theory lead to unending retries. > > For example, suppose a buggy device (without removable media) always > replies with UNIT ATTENTION without making any forward progress. > We'll just call scsi_requeue_command each time and get stuck. That's a separate bug from the current one ... fortunately one I don't think we've actually seen manifest. > > Now if we find an error where > > we should be moving on to the next transaction, then we'd need to do > > that test. However, I think I've demonstrated so far that there isn't > > one. I also don't think we want to make the test for the obviously > > retryable conditions like UNIT_ATTENTION because that will cause paths > > to flip on irrelevant AEN conditions. > > To be honest, I don't know what sort of requests get marked as > non-retryable in the block layer. Maybe you're right and we don't need > to worry about them. They tend to be device mapper ones. Anything that wants a fast failure to do path switch over for instance. > But I'm still concerned about the possibility of getting stuck doing > the same command or request over and over. Both structures have a > "retries" field, but I'm not clear on how/where they get used. Block relies on the lower layers for retry ... it just transmits the status, so we get to fix it. > > > I take your point. Which leads to the question: Why was the code ever > > > calling scsi_end_request(cmd, -EIO, this_count, 1) in the first place? > > > Apparently all of these paths should be setting the last argument to 0, > > > always. > > > > Yes, that was the question I asked in my first reply (where I said the > > requeue looks superfluous). I think the answer is that there's no > > point ... that's why I advocated simply eliminating scsi_end_request() > > in favour of either scsi_requeue_request/blk_run_queue or > > You mean just scsi_requeue_command? There is no scsi_requeue_request. Yes. > > end_dequeued_request(). > > Yes. Those new "goto"s in scsi_io_completion could be rearranged to > end up calling either scsi_requeue_command or end_dequeued_request. > That still leaves the initial call to scsi_end_request; I guess this is > where you would move that routine's contents into scsi_io_completion. > > > So are you happy with the simple fix proposal ... I think rearranging > > this code needs more debate and discussion. > > I tested your simple fix, and it does indeed solve the problem of tasks > hanging because of an uncompleted request. In view of Boaz's concerns, > should this change be postponed until 2.6.27.stable so that it can get > wider testing? We can ... I think it's safe enough though given it only affects multiple transaction commands. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html