On Thu, 2014-03-20 at 11:36 -0400, Alan Stern wrote: > On Wed, 19 Mar 2014, James Bottomley wrote: > > > > Basically, usb-storage deadlocks when the SCSI error handler invokes > > > the eh_device_reset_handler callback while a command is running. The > > > command has timed out and will never complete normally, because the > > > device's firmware has crashed. But usb-storage's device-reset routine > > > waits for the current command to finish, which brings everything to a > > > standstill. > > > > > > Is this design wrong? That is, should the device-reset routine wait > > > for currently executing commands to finish, or should it abort them, or > > > what? > > > > In some sense, yes, but not necessarily from the Point of View of USB. > > What we assume in SCSI is that commands are forgettable, meaning there's > > always some operation we can perform (be it abort or reset) that causes > > the device to forget about all outstanding commands and reset its > > internal state machine to a known good state. > > > > The cardinal SCSI assumption is that after we've successfully done an > > abort or reset on a command that it will never come back to us from the > > device. > > The same assumption is present in usb-storage. > > The difference appears to be that SCSI assumes a reset can be issued at > any time, even while a command is running. In usb-storage, that's true > for a _bus_ reset but it's not true for a _device_ reset. > > Perhaps this restriction on device resets could be lifted, but in real > life it probably won't make much difference. The fact is, almost no > USB mass-storage devices implement device reset correctly, whereas most > of them do implement bus reset. (Possibly related factoid: MS Windows > uses bus resets but doesn't use device resets in its USB mass-storage > driver.) Actually, you don't have to lift the restriction. Just fail the device reset if you can't issue it (so if there's any outstanding commands). SCSI will move on to a bus reset which you can accept. > > > Or should the SCSI error handler abort the running command before > > > invoking the eh_device_reset_handler callback? > > > > So this is rooted in the "Abort can be a Problem" issue: Abort > > sometimes works well (and it's not very disruptive) but sometimes if the > > device is having a problem in its command state machine, adding another > > command (which is what the abort is) doesn't actually do anything, so we > > need error escalation to reset. We can't wait for the abort or other > > commands to complete because they never will. The reset is expected to > > clear everything from the device (including the pending aborts). > > With usb-storage, aborts usually work pretty well. But sometimes they > don't cure the underlying cause, so when the offending command is > retried, it hangs up again. Something like that seems to be happening > here. > > > > For the record, and in case anyone is curious, here's the detailed > > > sequence of events during my test: > > > > > > sd issues a READ(10) command. For whatever reason, the device > > > goes nuts and the command times out. > > > > > > scsi_times_out() calls scsi_abort_command(), which queues an > > > abort request. > > > > > > scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which > > > succeeds in aborting the READ. > > > > > > The READ command is retried (I didn't trace through the details > > > of this). The retry fails with a Unit Attention (SK=6, > > > ASC=0x29, Reset or Bus Device Reset Occurred). > > > > > > The READ command is retried a second time, and it times out > > > again. > > > > > > This time around, scsi_times_out() calls scsi_abort_command() > > > unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is > > > still set). > > > > From the first time we sent the abort? > > Yes. As far as I can see, the only place where SCSI_EH_ABORT_SCHEDULED > gets cleared is in scsi_eh_finish_cmd(), which never got called. > After the successful abort, scmd_eh_abort_handler() went directly into > its "if (rtn == SUCCESS)" and "scmd %p retry aborted command" / > scsi_queue_insert() code paths. OK, that's a bug ... I'll see if I can find it. > > That sounds like a problem in > > our state tracking. > > Could well be. How should I track it down further? Or can you suggest > a patch just from this much information? > > > > As a result, scsi_error_handler() calls scsi_unjam_host(), > > > which calls scsi_eh_get_sense(). > > > > > > That routine calls scsi_request_sense(), which goes into > > > scsi_send_eh_cmnd(). > > > > I thought USB was autosense, so when it reports check condition, we > > should already have sense ... or are we calling request_sense without > > being sent a check condition status? > > usb-storage does autosense. As far as I can tell, the REQUEST SENSE is > issued because there is no sense data, which is because the READ > command is still running. Is this another state-tracking bug? Yes ... we should only issue a request sense if we got a check condition return. If we're doing it without that, something is wrong somewhere. > > > The calls to shost->hostt->queuecommand() all fail, because the > > > READ command is still running and usb-storage has a queue > > > depth of 1. The error messages produced by these failures are > > > disconcerting but not dangerous. > > > > > > Since the REQUEST SENSE command was never issued, > > > scsi_eh_get_sense() returns 0. > > > > > > scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which > > > does essentially nothing because the SCSI_EH_CANCEL_CMD flag > > > for the only command on work_q is clear. > > > scsi_eh_test_devices() returns 0 because check_list is empty > > > and work_q isn't. > > > > > > scsi_unjam_host() then calls scsi_eh_ready_devs(). This > > > routine ends up calling scsi_eh_bus_device_reset(), at which > > > point usb-storage deadlocks as described above. > > > > OK, so in the case where the command can never complete (because the fw > > has crashed), what should be the process for resetting the device so it > > can again function? > > Simply abort the command. usb-storage will automatically carry out a > bus reset under those conditions, and that's the best we can do. > > As you can see, the sequence above started out doing exactly this. > The problems began when the command was retried. That time it didn't > get aborted, thanks to the leftover SCSI_EH_ABORT_SCHEDULED flag. > Things got worse from there. OK, so I think we have three things to do 1. Investigate SCSI and fix it's abort state problem that's causing it not to send the abort second time around 2. Fix usb-storage to fail a reset it can't do (i.e. device reset with outstanding commands) 3. Find out why we're sending a spurious request sense. I can look at 1 and 3 if you want to take 2. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html