Re: Deadlock in usb-storage error handling

Dan Williams <dan.j.williams@xxxxxxxxx> · Wed, 19 Mar 2014 13:54:18 -0700



On Wed, Mar 19, 2014 at 1:31 PM, Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote:
> On Wed, 19 Mar 2014, Andreas Reis wrote:
>
>> I've uploaded a dmesg with the new debugging patch to bugzilla:
>> https://bugzilla.kernel.org/attachment.cgi?id=130041
>
> Thanks.  I have now managed to reproduce many of the features of this
> problem on my own computer.
>
> James, I will need your help (or help from somebody who understands the
> SCSI error handler) to figure out how this problem should be fixed.
>
> Basically, usb-storage deadlocks when the SCSI error handler invokes
> the eh_device_reset_handler callback while a command is running.  The
> command has timed out and will never complete normally, because the
> device's firmware has crashed.  But usb-storage's device-reset routine
> waits for the current command to finish, which brings everything to a
> standstill.
>
> Is this design wrong?  That is, should the device-reset routine wait
> for currently executing commands to finish, or should it abort them, or
> what?
>
> Or should the SCSI error handler abort the running command before
> invoking the eh_device_reset_handler callback?
>
> For the record, and in case anyone is curious, here's the detailed
> sequence of events during my test:
>
>         sd issues a READ(10) command.  For whatever reason, the device
>         goes nuts and the command times out.
>
>         scsi_times_out() calls scsi_abort_command(), which queues an
>         abort request.
>
>         scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which
>         succeeds in aborting the READ.
>
>         The READ command is retried (I didn't trace through the details
>         of this).  The retry fails with a Unit Attention (SK=6,
>         ASC=0x29, Reset or Bus Device Reset Occurred).
>
>         The READ command is retried a second time, and it times out
>         again.
>
>         This time around, scsi_times_out() calls scsi_abort_command()
>         unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is
>         still set).
>
>         As a result, scsi_error_handler() calls scsi_unjam_host(),
>         which calls scsi_eh_get_sense().
>
>         That routine calls scsi_request_sense(), which goes into
>         scsi_send_eh_cmnd().
>
>         The calls to shost->hostt->queuecommand() all fail, because the
>         READ command is still running and usb-storage has a queue
>         depth of 1.  The error messages produced by these failures are
>         disconcerting but not dangerous.
>
>         Since the REQUEST SENSE command was never issued,
>         scsi_eh_get_sense() returns 0.
>
>         scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which
>         does essentially nothing because the SCSI_EH_CANCEL_CMD flag
>         for the only command on work_q is clear.
>         scsi_eh_test_devices() returns 0 because check_list is empty
>         and work_q isn't.
>
>         scsi_unjam_host() then calls scsi_eh_ready_devs().  This
>         routine ends up calling scsi_eh_bus_device_reset(), at which
>         point usb-storage deadlocks as described above.
>
> (On Andreas's system, the first READ retry times out as opposed to the
> second retry as on my computer.  I doubt this makes any difference.)
>
> I can't tell if this is all working as intended or if it went off the
> tracks somewhere.
>
> Thanks for any guidance.
>

It seems to me we need an ->eh_strategy_handler() that understands the
usb transport and will escalate to device reset around the time
scsi_abort_command() fails.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html