Re: Reproducible deadlock when usb-storage scsi command timeouts twice

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Sun, 30 Apr 2023 17:10:19 -0400

On Sun, Apr 30, 2023 at 12:39:12PM -0700, Bart Van Assche wrote:
> On 4/26/23 12:20, Alan Stern wrote:
> > > This patch fixes the issue, not sure if it's correct:
> > > 
> > > --- a/drivers/usb/storage/scsiglue.c
> > > +++ b/drivers/usb/storage/scsiglue.c
> > > @@ -455,6 +455,9 @@ static int device_reset(struct scsi_cmnd *srb)
> > >          usb_stor_dbg(us, "%s called\n", __func__);
> > > +       if (us->srb == srb)
> > > +               command_abort(srb);
> > > +
> > >          /* lock the device pointers and do the reset */
> > >          mutex_lock(&(us->dev_mutex));
> > >          result = us->transport_reset(us);
> > 
> > Maybe...  But it would be better to check first whether the SCSI core is
> > supposed to be reusing an active srb in this way.
> > 
> > Martin, can tell us what is supposed to happen here?  Is the
> > eh_device_reset_handler routine supposed to be called with a scsi_cmnd
> > for a currently active command?
> 
> Hi Alan,
> 
> I'm not aware of any other .eh_device_reset_handler implementation that
> aborts the command that is passed to this callback before it aborts other
> SCSI commands.

That's not a good description of what's happening.  For one thing, 
usb-storage can process only one command at a time, so the "before it 
aborts other SCSI commands" part doesn't apply.  In fact, the proposed 
patch aborts the current command (which happens to be the same as the 
one passed to the .eh_device_reset_handler routine) before beginning a 
reset.

>  However, I'm not aware of an equivalent of us_data.dev_mutex
> in other SCSI LLDs either. Maybe this deadlock is specific to the USB
> storage handler?

The deadlock, if that's what it is, is related to the fact that 
usb-storage does only one thing at a time.  The mutex is there merely to 
enforce this; removing it wouldn't help matters since the driver still 
wouldn't be capable of doing more than one thing at a time.

The underlying unstated issue is that usb-storage doesn't have any 
timeouts of its own.  If a device fails to complete a command properly 
and the SCSI error handler doesn't abort the command, then the command 
will never finish and the driver will hang.  (Until the device is 
hot-unplugged, of course.)

I suppose a bus reset would probably cause an ongoing command to fail.  
A device reset almost certainly would not, since most device 
implementations of the USB bulk-only mass-storage protocol do a very bad 
job of handling device resets.

Either way, it's a lot cleaner simply to abort the command before 
starting the reset.  If usb-storage can't depend on the SCSI error 
handler to do this for us, it will have to do so itself.

Alan Stern

> Martin, feel free to correct me if I got anything wrong.
> 
> Bart.