Re: How are halted endpoints supposed to be handled in Linux?

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Thu, 21 Nov 2024 10:06:50 -0500

On Thu, Nov 21, 2024 at 11:26:53AM +0100, Michał Pecio wrote:
> Hi Alan,
> 
> Thank you for taking the time to answer. I'm beginning to regret not
> asking this question earlier.

I hope all the material here is accurate; it's been a long time since I 
worked on these matters.

> On Wed, 20 Nov 2024 21:31:29 -0500, Alan Stern wrote:
> > > > Linux appears to ignore this part and only reset on STALL
> > > > handshake, as advised in
> > > > Documentation/driver-api/usb/error-codes.rst and practiced by
> > > > drivers - they don't seem to bother with usb_clear_halt() on
> > > > -EPROTO.  
> > 
> > This is generally a weakness in the drivers.  It would be nice if the 
> > class specifications said what to do in these situations, but most of 
> > them don't.
> 
> There is also proprietary hardware with no specification at all.

Indeed.

> > > > On the HCD side, xHCI will:
> > > > - give back the current URB with -EPROTO/-EPIPE status
> > > > - reset the host side endpoint, clearing its toggle state
> > > > - point the HC at the next URB if one exist
> > > > - restart the endpoint without waiting for hcd->endpoint_reset()
> > > > - ignore one subsequent call to hcd->endpoint_reset()  
> > 
> > This behavior is not correct.  See the kerneldoc for 
> > usb_unlink_urb() in drivers/usb/core/urb.c, especially the section 
> > labelled "Unlinking and Endpoint Queues".
> 
> OK, let's go through it.
> 
>  * But when an URB terminates with an
>  * error its queue generally stops (see below), at least until that URB's
>  * completion routine returns.
> 
> I don't see this working after xHCI adopted bottom half giveback, which
> is asynchronous. As you are the maintainer of EHCI, which also uses BH,
> how is EHCI dealing with it?

Yes, I am.

For Bulk endpoints: When a transmission error occurs, ehci-hcd removes 
the endpoint's queue header from the controller's async list, in 
addition to giving back the failed URB.  When the removal is complete, 
the queue is scanned for other unlinked URBs, and they are given back.  
After that happens, the next URB submission will cause the queue header 
to be put back on the controller's async list.

You're right about the lack of synchronization caused by use of a BH.  
The HCD has no way to know when the class driver has finished processing 
a giveback.  We do need to fix this.

> One way I see with existing APIs is to wait until the driver submits a
> new URB, but are drivers prepared for this? Is EHCI doing the same?

Yes; see above.

>  * It is guaranteed that a stopped queue
>  * will not restart until all its unlinked URBs have been fully retired,
>  * with their completion routines run
> 
> I think xHCI can currently guarantee that nothing is restarted until
> any URB unlinked for any reason is given to the BH worker, but that's
> all we have, and I just broke it in usb-next:
> 
> +               /* Try to restart the endpoint if all is done */
> +               ring_doorbell_for_active_rings(xhci, slot_id, ep_index);
> +               /* Start giving back any TDs invalidated above */
> +               xhci_giveback_invalidated_tds(ep);
> 
> This is part of a legitimate bugfix patch tagged for stable. I should
> have insisted on keeping it a separate change, but it seemed a good idea
> at the time which would soon get implemented anyway... Maybe no more?
> 
>  * even if that's not until some time
>  * after the original completion handler returns.
> 
> Not entirely sure what this means.

Suppose URBs 1, 2, and 3 are queued when URB 1 gets a transmission 
error.  It is given back, and its completion handler unlinks URBs 2 and 
3.  The completion handler for URB 1 then returns, but the queue won't 
restart until the completion handlers for URBs 2 and 3 have returned, 
even if that doesn't occur for some time.  At least, that is the intent.

>  * The same behavior and
>  * guarantee apply when an URB terminates because it was unlinked.
> 
> Same caveat about BH asynchronicity in xHCI.
> 
> > In general, the only safe thing for class drivers to do when they get 
> > one of these errors on a bulk or interrupt endpoint is to unlink all
> > the pending URBs for the endpoint and then call usb_clear_halt() when
> > they have all completed.  I know that usb-storage does this; I
> > suspect that not many other drivers do.
> 
> Your suspicion isn't wrong AFAIK.
> 
> One more thing which is safe at least wrt data corruption is to simply
> retry the same URB without resetting anything. But if an HCD wants to
> do it, existing API gives no way to notify the driver and allow it to
> opt out and handle things differently, so it mustn't retry forever.

Yes, and in fact ehci-hcd _does_ retry up to 32 times.

> > I suppose the USB core could take care of this automatically so that 
> > neither the class drivers nor the HCDs would have to worry about it.
> > If everyone agrees that this wouldn't lead to other problems, it
> > could be implemented without too much difficulty.
> 
> This still appears to run into the double delivery problem that you
> described in the discussion linked by Thinh Nguyen, particularly in
> case of dodgy drivers for dodgy hardware.
> 
> Considering that, I'm not sure if automatically resetting anything on
> -EPROTO is a good idea.

What about automatic unlinking?

Note that some class drivers treat -EPROTO as a fatal error.  That is, 
they don't retry and their completion-resubmission loop breaks down.  
When that happens, the only way forward is to unbind the driver (for 
example, by unplugging the device).  Of course, this isn't a problem if 
the original cause of the -EPROTO error is that the device just _was_ 
unplugged.  All other cases are sufficiently rare that we don't have a 
generally agreed-upon strategy for handling them.

> > > > I wonder what other HCDs are doing in this case, and what's the
> > > > idea behind it all?  
> > 
> > As far as I know, they don't automatically send Clear-Halt requests
> > to the device or automatically unlink anything.
> 
> That's what it looks like, grepping through drivers/usb/host. But my
> question was mainly about -EPROTO. When should an HCD restart a halted
> endpoint? Should it clear the sequence state? (probably not).

Except for an immediate retry, I suspect the HCD should never restart a 
halted Bulk or Interrupt endpoint on its own initiative.  Not until 
another URB is submitted.

However, this seems impractical if the class driver wants to retain the 
existing URBs already on the endpoint's queue.  (I don't know of any 
drivers that do this, but still...)  Perhaps we should adopt the policy 
that -EPROTO, -EILSEQ, and -ETIME cause all outstanding URBs to fail and 
enforce this policy in usbcore by automatic unlinking so that HC 
drivers don't have to do it.

Alan Stern