Re: [RFC 8/9] UAS: Cancel pending URBs and free streams in pre-reset.

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Mon, 5 Dec 2011 17:14:57 -0500 (EST)

On Mon, 5 Dec 2011, Sarah Sharp wrote:

> > > It's true that another interface driver could want to reset the USB
> > > device, not just the UAS driver.  But is there a way to tell the
> > > difference without adding code to the UAS driver?
> > 
> > There _is_ no real difference.  For example, it's entirely possible
> > that another interface's driver could decide to reset the device at the
> > same time as the UAS driver.
> > 
> > On the other hand, it's not a big deal to add a single flag meaning "I 
> > have decided to reset the device".
> 
> Ok, sure.  So your suggestion would be to check that flag, and if it's
> not set, attempt to wait for the commands to flush themselves out before
> canceling and killing them?

I don't know.  That's a design decision and it helps to have experience
with real-life UAS disks and their problems.  You have much more
experience with them than I do (namely, zero).

> > In that case the other driver would have to wait for the timeout.  You
> > could improve the situation a little when this occurs by giving the
> > existing commands only a short time in which to complete, after which
> > you try to abort them.  The abort would also need to have a short
> > timeout.
> 
> Ok, so for each anchor, I would call usb_wait_anchor_empty_timeout()
> with a short timeout, then usb_kill_anchored_urbs()?
> 
> BTW, how short is short?  There could be up to 256 commands queued and
> we don't want the user to wait for, say, more than a second, so I don't
> think the timeout should be bigger than about 3ms?

You could use a single timeout of 500 ms (or something like that) for
all the queued commands in parallel.  Anything that hasn't finished by 
then gets aborted.

> > I guess this depends on your philosophy: When a reset is coming up, do
> > you want to allow as many outstanding commands as possible to complete
> > normally?  Or do you prefer to abort them all immediately?
> 
> Well, by the time we get to a reset (one issued by the UAS driver, not
> another driver), the device has already refused to take a particular
> command out of its queue by either not answering or returning a failing
> status for the ABORT TASK IU.  So at that point we know the device has
> something wrong with its SCSI queue handling.  Do we trust it with other
> commands?  I'm not sure, but I lean towards aborting all outstanding
> commands rather than letting them finish.

All right, that's your choice.

> > If you do decide to cancel them, then yes, uas_stat_cmplt requires some
> > sort of synchronization.  But it has to be synchronized with the
> > cancellation operation, not with the reset.
> 
> uas_stat_cmplt() needs to be synchronized with the reset operation
> because it can cause the work queue to be scheduled.  uas_prereset needs
> to ensure that work queue is completely stopped before it attempts to
> manipulate the workqueue and kill any partially queued URBs.
> 
> > For example, what would happen if the driver were asked to cancel a
> > command just as uas_stat_cmplt was called?
> 
> Ok, you're correct that there is a problem there, but I think it's a
> separate issue from whether we need SRCU for the work queue sync.

Yes, it is separate.  At this point I was considering what sort of 
synchronization you need for uas_stat_cmplt().

> I think the race condition you saw would go something like this:
> 
>  - SCSI/block layer attempts to cancel a command for a USB 2.0 UAS
>    device, and uas_eh_abort_handler() gets past issuing the abort task,
>    and killing any anchored URBs.
>  - In interrupt, the status command completes, submits the new data and
>    status URBs, and returns.
>  - uas_eh_abort_handler() returns and lets the SCSI core know the task
>    was successfully aborted.
>  - Later the two URBs return, and we attempt to complete a command the
>    SCSI/block layer already returned as canceled.

More or less.  There are other similar races, such as between sending
the data & status URBs vs. sending ABORT-TASK URBs.  The real question
is when you decide that a command has completed, and I don't remember
the details of the protocol well enough to say much more than that.

Anyway, if you fix these races you'll find that you have taken care of
the pre-reset synchronization as well.

> I think that could be worked around by adding an atomic_t or using
> set_bit for a flag in the cmnd_info structure.  I had noticed the rather
> long comment above usb_stor_blocking_completion() in usb-storage, and
> had intended to do something similar with the UAS driver.  If you have
> any advice about how to work around this issue, it would be appreciated.

To tell the truth, usb-storage isn't a great model here.  It works, but
I did most of my work on it before I was fully aware of all the issues
involved in synchronizing multiple threads of control.  I wouldn't be
at all suprised to find there are places where usb-storage is missing
memory barriers or something of the sort.  Also, usb-storage handles
only one command at a time, so it doesn't have to face the same
complexity as UAS.

Since aborts are uncommon and you don't want to clutter up the fast
path, it makes sense to use some form of RCU to synchronize aborts with
uas_stat_cmplt().  Once that's done, the pre-reset routine merely has
to abort all the commands currently in flight; the abort routine will
take care of the synchronization.

> > What happens when you decide to cancel a command that has some URBs 
> > sitting on the workqueue's queue?  How do you prevent the workqueue 
> > from submitting them?
> 
> We take the workqueue lock and remove them from the work queue.  If the
> workqueue has already re-submitted the URBs and removed them from its
> own workqueue, then we'll just wait on them with
> usb_kill_anchored_urbs().
> 
> But it's not so simple in the reset case.  We want to stop the work
> queue completely and kill any pending URBs, so we have to sync with all
> the functions that might schedule the work queue, and prevent any new
> URBs from being submitted.

But the workqueue gets scheduled from only a few places, right?  So
once you have prevented new commands from reaching those places
(primarily the queuecommand routine -- I don't think commands can come
from anywhere else), then all the workqueue needs is the usual abort
processing.

In other words, once the synchronization for abort is done properly, it
turns out that pre-reset needs only to synchronize with queuecommand.  
And that requires just the host lock.

> > >  It also isn't taken in the
> > > status phase handler (uas_stat_cmplt) that resubmits the second pair of
> > > URBs for USB 2.0 UAS devices.  I'm not sure if the SCSI host lock can be
> > > taken in interrupt context?
> > 
> > It can.  However, from the comments above, it's not clear that 
> > uas_stat_cmplt needs to do anything special.
> 
> It can schedule the workqueue, which is why it needs to sync with
> uas_prereset.

No, as explained above, uas_stat_cmplt needs to synchronize with the
abort handler.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html