Re: How are halted endpoints supposed to be handled in Linux?

Michał Pecio <michal.pecio@xxxxxxxxx> · Thu, 21 Nov 2024 11:26:53 +0100

Hi Alan,

Thank you for taking the time to answer. I'm beginning to regret not
asking this question earlier.

On Wed, 20 Nov 2024 21:31:29 -0500, Alan Stern wrote:
> > > Linux appears to ignore this part and only reset on STALL
> > > handshake, as advised in
> > > Documentation/driver-api/usb/error-codes.rst and practiced by
> > > drivers - they don't seem to bother with usb_clear_halt() on
> > > -EPROTO.  
> 
> This is generally a weakness in the drivers.  It would be nice if the 
> class specifications said what to do in these situations, but most of 
> them don't.

There is also proprietary hardware with no specification at all.

> > > On the HCD side, xHCI will:
> > > - give back the current URB with -EPROTO/-EPIPE status
> > > - reset the host side endpoint, clearing its toggle state
> > > - point the HC at the next URB if one exist
> > > - restart the endpoint without waiting for hcd->endpoint_reset()
> > > - ignore one subsequent call to hcd->endpoint_reset()  
> 
> This behavior is not correct.  See the kerneldoc for 
> usb_unlink_urb() in drivers/usb/core/urb.c, especially the section 
> labelled "Unlinking and Endpoint Queues".

OK, let's go through it.

 * But when an URB terminates with an
 * error its queue generally stops (see below), at least until that URB's
 * completion routine returns.

I don't see this working after xHCI adopted bottom half giveback, which
is asynchronous. As you are the maintainer of EHCI, which also uses BH,
how is EHCI dealing with it?

One way I see with existing APIs is to wait until the driver submits a
new URB, but are drivers prepared for this? Is EHCI doing the same?

 * It is guaranteed that a stopped queue
 * will not restart until all its unlinked URBs have been fully retired,
 * with their completion routines run

I think xHCI can currently guarantee that nothing is restarted until
any URB unlinked for any reason is given to the BH worker, but that's
all we have, and I just broke it in usb-next:

+               /* Try to restart the endpoint if all is done */
+               ring_doorbell_for_active_rings(xhci, slot_id, ep_index);
+               /* Start giving back any TDs invalidated above */
+               xhci_giveback_invalidated_tds(ep);

This is part of a legitimate bugfix patch tagged for stable. I should
have insisted on keeping it a separate change, but it seemed a good idea
at the time which would soon get implemented anyway... Maybe no more?

 * even if that's not until some time
 * after the original completion handler returns.

Not entirely sure what this means.

 * The same behavior and
 * guarantee apply when an URB terminates because it was unlinked.

Same caveat about BH asynchronicity in xHCI.

> In general, the only safe thing for class drivers to do when they get 
> one of these errors on a bulk or interrupt endpoint is to unlink all
> the pending URBs for the endpoint and then call usb_clear_halt() when
> they have all completed.  I know that usb-storage does this; I
> suspect that not many other drivers do.

Your suspicion isn't wrong AFAIK.

One more thing which is safe at least wrt data corruption is to simply
retry the same URB without resetting anything. But if an HCD wants to
do it, existing API gives no way to notify the driver and allow it to
opt out and handle things differently, so it mustn't retry forever.

> I suppose the USB core could take care of this automatically so that 
> neither the class drivers nor the HCDs would have to worry about it.
> If everyone agrees that this wouldn't lead to other problems, it
> could be implemented without too much difficulty.

This still appears to run into the double delivery problem that you
described in the discussion linked by Thinh Nguyen, particularly in
case of dodgy drivers for dodgy hardware.

Considering that, I'm not sure if automatically resetting anything on
-EPROTO is a good idea.

> > > I wonder what other HCDs are doing in this case, and what's the
> > > idea behind it all?  
> 
> As far as I know, they don't automatically send Clear-Halt requests
> to the device or automatically unlink anything.

That's what it looks like, grepping through drivers/usb/host. But my
question was mainly about -EPROTO. When should an HCD restart a halted
endpoint? Should it clear the sequence state? (probably not).

Regards,
Michal