Re: [RFC 8/9] UAS: Cancel pending URBs and free streams in pre-reset.

Sarah Sharp <sarah.a.sharp@xxxxxxxxxxxxxxx> · Mon, 5 Dec 2011 12:18:26 -0800

On Sat, Dec 03, 2011 at 10:54:22AM -0500, Alan Stern wrote:
> On Fri, 2 Dec 2011, Sarah Sharp wrote:
> 
> > > Bear in mind that pre_reset isn't necessarily trying to flush any 
> > > commands.  It just gets called from time to time.  Maybe because the 
> > > device is stuck and needs to be reset; maybe because some other driver 
> > > for a different interface on the same device needs to reset it.
> > >
> > > You do need to decide what to do about commands that are in flight.  
> > > Ideally you would wait until they all complete by themselves (or are
> > > aborted by the error handler, which amounts to the same thing).  
> > > Actively cancelling their URBs in preparation for the reset is not as
> > > safe, although it is necessary when things time out or aborts fail.
> > 
> > It's true that another interface driver could want to reset the USB
> > device, not just the UAS driver.  But is there a way to tell the
> > difference without adding code to the UAS driver?
> 
> There _is_ no real difference.  For example, it's entirely possible
> that another interface's driver could decide to reset the device at the
> same time as the UAS driver.
> 
> On the other hand, it's not a big deal to add a single flag meaning "I 
> have decided to reset the device".

Ok, sure.  So your suggestion would be to check that flag, and if it's
not set, attempt to wait for the commands to flush themselves out before
canceling and killing them?

> > If the UAS driver is trying to reset the device, then it probably makes
> > sense to flush/kill everything rather than simply wait for it to
> > complete.  At that point, the device will have already refused to
> > respond to an abort task IU for a queued SCSI command.  The devices I've
> > seen that need a reset just seem wedged, and no other commands complete
> > on other stream IDs.
> > 
> > I'm also not sure we can just wait for the commands to complete.  Since
> > the device chooses which streams to request data for (and thus which
> > SCSI commands to service), it's entirely possible the device could just
> > starve a particular command indefinitely due to some fireware bug.  So
> > another driver asks for a reset, we let the commands complete on their
> > own (while not submitting new commands), the starved SCSI command
> > eventually times out and the abort task fails, and then the UAS driver
> > asks for a reset.  What happens then?
> 
> In that case the other driver would have to wait for the timeout.  You
> could improve the situation a little when this occurs by giving the
> existing commands only a short time in which to complete, after which
> you try to abort them.  The abort would also need to have a short
> timeout.

Ok, so for each anchor, I would call usb_wait_anchor_empty_timeout()
with a short timeout, then usb_kill_anchored_urbs()?

BTW, how short is short?  There could be up to 256 commands queued and
we don't want the user to wait for, say, more than a second, so I don't
think the timeout should be bigger than about 3ms?

> > > > The solution is to introduce a new atomic_t, resetting.  When the
> > > 
> > > Why atomic_t?  Is the flag going to be written from multiple threads or
> > > contexts?
> > 
> > I think I used atomic_t because it has the memory barrier semantics I
> > need to make SRCU work.  I'll have to dig up my notes on that.
> 
> I don't think atomic_t values have any memory barrier semantics in 
> particular.  Some of the functions that operate on atomic_t's do have 
> memory barriers, though.
> 
> Don't the SRCU primitives already provide all the memory barriers you 
> need?  They are supposed to.

Ok, yes, I should use ACCESS_ONCE here to avoid the compiler reordering,
and it has nothing to do with memory barriers.  Sorry for the confusion.

> > > > resetting flag is set, uas_queuecommand_lck() should tell the SCSI layer
> > > > that the device is busy for any new commands, uas_stat_cmplt() will tell
> > > > the SCSI layer the command aborted without submitting the data URBs, and
> > > 
> > > This is confusing.  What you want to do is plug your queue.  That is, 
> > > stop sending out new commands, while waiting for the existing ones to 
> > > complete.  Is that what you meant?  (I'm not sure exactly what 
> > > uas_stat_cmplt's job is.)
> > 
> > With USB 2.0 devices, we submit one pair of URBs (the command phase and
> > status phase), and then submit a second pair of URBs (the data phase and
> > status phase) in the first status phase's URB completion handler
> > (uas_stat_cmplt).  If the device is about to be reset because the UAS
> > driver requested it, I'm not sure it makes sense to submit the second
> > round of URBs.  I suppose we could let them go through if a different
> > driver was requesting the reset, but we'd need a flag or something
> > similar to signal the UAS driver was the requester.
> 
> I guess this depends on your philosophy: When a reset is coming up, do
> you want to allow as many outstanding commands as possible to complete
> normally?  Or do you prefer to abort them all immediately?

Well, by the time we get to a reset (one issued by the UAS driver, not
another driver), the device has already refused to take a particular
command out of its queue by either not answering or returning a failing
status for the ABORT TASK IU.  So at that point we know the device has
something wrong with its SCSI queue handling.  Do we trust it with other
commands?  I'm not sure, but I lean towards aborting all outstanding
commands rather than letting them finish.

> If you do decide to cancel them, then yes, uas_stat_cmplt requires some
> sort of synchronization.  But it has to be synchronized with the
> cancellation operation, not with the reset.

uas_stat_cmplt() needs to be synchronized with the reset operation
because it can cause the work queue to be scheduled.  uas_prereset needs
to ensure that work queue is completely stopped before it attempts to
manipulate the workqueue and kill any partially queued URBs.

> For example, what would happen if the driver were asked to cancel a
> command just as uas_stat_cmplt was called?

Ok, you're correct that there is a problem there, but I think it's a
separate issue from whether we need SRCU for the work queue sync.

I think the race condition you saw would go something like this:

 - SCSI/block layer attempts to cancel a command for a USB 2.0 UAS
   device, and uas_eh_abort_handler() gets past issuing the abort task,
   and killing any anchored URBs.
 - In interrupt, the status command completes, submits the new data and
   status URBs, and returns.
 - uas_eh_abort_handler() returns and lets the SCSI core know the task
   was successfully aborted.
 - Later the two URBs return, and we attempt to complete a command the
   SCSI/block layer already returned as canceled.

I think that could be worked around by adding an atomic_t or using
set_bit for a flag in the cmnd_info structure.  I had noticed the rather
long comment above usb_stor_blocking_completion() in usb-storage, and
had intended to do something similar with the UAS driver.  If you have
any advice about how to work around this issue, it would be appreciated.

> > > > We use SRCU to synchronize between the readers of the resetting flag
> > > > (uas_queuecommand_lck, uas_xfer_data, and uas_eh_abort_handler) and the
> > > > writers of the flag (uas_pre_reset and uas_post_reset).  We have to use
> > > > "sleepable" RCU instead of RCU because the readers must be able to sleep
> > > > while waiting for URBs to be killed, while holding the reader lock.
> > > 
> > > It's not clear that this is the best approach.  You already own the
> > > host lock in uas_queuecommand_lck; why not use that for
> > > synchronization?
> > 
> > The host lock isn't taken in the UAS workqueue (which attempts to
> > resubmit URBs for SCSI commands that initially failed due to out of
> > memory or errors from usb_submit_urb).
> 
> Yeah, okay, that sounds like it may be a reasonable use for SRCU.
> 
> What happens when you decide to cancel a command that has some URBs 
> sitting on the workqueue's queue?  How do you prevent the workqueue 
> from submitting them?

We take the workqueue lock and remove them from the work queue.  If the
workqueue has already re-submitted the URBs and removed them from its
own workqueue, then we'll just wait on them with
usb_kill_anchored_urbs().

But it's not so simple in the reset case.  We want to stop the work
queue completely and kill any pending URBs, so we have to sync with all
the functions that might schedule the work queue, and prevent any new
URBs from being submitted.

> >  It also isn't taken in the
> > status phase handler (uas_stat_cmplt) that resubmits the second pair of
> > URBs for USB 2.0 UAS devices.  I'm not sure if the SCSI host lock can be
> > taken in interrupt context?
> 
> It can.  However, from the comments above, it's not clear that 
> uas_stat_cmplt needs to do anything special.

It can schedule the workqueue, which is why it needs to sync with
uas_prereset.

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html