Deadlock in usb-storage error handling

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Wed, 19 Mar 2014 16:31:42 -0400 (EDT)

On Wed, 19 Mar 2014, Andreas Reis wrote:

> I've uploaded a dmesg with the new debugging patch to bugzilla:
> https://bugzilla.kernel.org/attachment.cgi?id=130041

Thanks.  I have now managed to reproduce many of the features of this
problem on my own computer.

James, I will need your help (or help from somebody who understands the 
SCSI error handler) to figure out how this problem should be fixed.

Basically, usb-storage deadlocks when the SCSI error handler invokes
the eh_device_reset_handler callback while a command is running.  The
command has timed out and will never complete normally, because the
device's firmware has crashed.  But usb-storage's device-reset routine
waits for the current command to finish, which brings everything to a
standstill.

Is this design wrong?  That is, should the device-reset routine wait 
for currently executing commands to finish, or should it abort them, or 
what?

Or should the SCSI error handler abort the running command before 
invoking the eh_device_reset_handler callback?

For the record, and in case anyone is curious, here's the detailed
sequence of events during my test:

	sd issues a READ(10) command.  For whatever reason, the device
	goes nuts and the command times out.

	scsi_times_out() calls scsi_abort_command(), which queues an
	abort request.

	scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which
	succeeds in aborting the READ.

	The READ command is retried (I didn't trace through the details
	of this).  The retry fails with a Unit Attention (SK=6, 
	ASC=0x29, Reset or Bus Device Reset Occurred).

	The READ command is retried a second time, and it times out 
	again.

	This time around, scsi_times_out() calls scsi_abort_command()
	unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is
	still set).

	As a result, scsi_error_handler() calls scsi_unjam_host(), 
	which calls scsi_eh_get_sense().

	That routine calls scsi_request_sense(), which goes into
	scsi_send_eh_cmnd().

	The calls to shost->hostt->queuecommand() all fail, because the
	READ command is still running and usb-storage has a queue
	depth of 1.  The error messages produced by these failures are
	disconcerting but not dangerous.

	Since the REQUEST SENSE command was never issued, 
	scsi_eh_get_sense() returns 0.

	scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which
	does essentially nothing because the SCSI_EH_CANCEL_CMD flag
	for the only command on work_q is clear.  
	scsi_eh_test_devices() returns 0 because check_list is empty
	and work_q isn't.

	scsi_unjam_host() then calls scsi_eh_ready_devs().  This
	routine ends up calling scsi_eh_bus_device_reset(), at which 
	point usb-storage deadlocks as described above.

(On Andreas's system, the first READ retry times out as opposed to the
second retry as on my computer.  I doubt this makes any difference.)

I can't tell if this is all working as intended or if it went off the 
tracks somewhere.

Thanks for any guidance.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html