On Wed, 19 Mar 2014, Andreas Reis wrote: > I've uploaded a dmesg with the new debugging patch to bugzilla: > https://bugzilla.kernel.org/attachment.cgi?id=130041 Thanks. I have now managed to reproduce many of the features of this problem on my own computer. James, I will need your help (or help from somebody who understands the SCSI error handler) to figure out how this problem should be fixed. Basically, usb-storage deadlocks when the SCSI error handler invokes the eh_device_reset_handler callback while a command is running. The command has timed out and will never complete normally, because the device's firmware has crashed. But usb-storage's device-reset routine waits for the current command to finish, which brings everything to a standstill. Is this design wrong? That is, should the device-reset routine wait for currently executing commands to finish, or should it abort them, or what? Or should the SCSI error handler abort the running command before invoking the eh_device_reset_handler callback? For the record, and in case anyone is curious, here's the detailed sequence of events during my test: sd issues a READ(10) command. For whatever reason, the device goes nuts and the command times out. scsi_times_out() calls scsi_abort_command(), which queues an abort request. scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which succeeds in aborting the READ. The READ command is retried (I didn't trace through the details of this). The retry fails with a Unit Attention (SK=6, ASC=0x29, Reset or Bus Device Reset Occurred). The READ command is retried a second time, and it times out again. This time around, scsi_times_out() calls scsi_abort_command() unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is still set). As a result, scsi_error_handler() calls scsi_unjam_host(), which calls scsi_eh_get_sense(). That routine calls scsi_request_sense(), which goes into scsi_send_eh_cmnd(). The calls to shost->hostt->queuecommand() all fail, because the READ command is still running and usb-storage has a queue depth of 1. The error messages produced by these failures are disconcerting but not dangerous. Since the REQUEST SENSE command was never issued, scsi_eh_get_sense() returns 0. scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which does essentially nothing because the SCSI_EH_CANCEL_CMD flag for the only command on work_q is clear. scsi_eh_test_devices() returns 0 because check_list is empty and work_q isn't. scsi_unjam_host() then calls scsi_eh_ready_devs(). This routine ends up calling scsi_eh_bus_device_reset(), at which point usb-storage deadlocks as described above. (On Andreas's system, the first READ retry times out as opposed to the second retry as on my computer. I doubt this makes any difference.) I can't tell if this is all working as intended or if it went off the tracks somewhere. Thanks for any guidance. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html