Re: How are halted endpoints supposed to be handled in Linux?

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Wed, 20 Nov 2024 21:31:29 -0500

On Thu, Nov 21, 2024 at 12:02:19AM +0000, Thinh Nguyen wrote:
> ++Alan
> 
> Hi,
> 
> On Thu, Nov 21, 2024, Michał Pecio wrote:
> > Hi,
> > 
> > I have been wondering about it after seeing how it's done in xhci_hcd,
> > which looks wrong to me.
> > 
> > USB 2.0 spec 5.7.5/5.8.5 states that halt condition due to either STALL
> > handshake or "transmission error" should cause both the device and host
> > endpoints to be reset. I presume "transmission error" means any error
> > detected by the HC which causes it to halt, various examples exist.

The spec is unfortunately vague about what a halt condition is, although 
it apparently intends the various sorts of transmission errors to count.

> > USB 3.0 just refers to USB 2.0.
> > 
> > Linux appears to ignore this part and only reset on STALL handshake, as
> > advised in Documentation/driver-api/usb/error-codes.rst and practiced
> > by drivers - they don't seem to bother with usb_clear_halt() on -EPROTO.

This is generally a weakness in the drivers.  It would be nice if the 
class specifications said what to do in these situations, but most of 
them don't.

> > This wouldn't necessarily be bad in itself, but:
> > 
> > On the HCD side, xHCI will:
> > - give back the current URB with -EPROTO/-EPIPE status
> > - reset the host side endpoint, clearing its toggle state
> > - point the HC at the next URB if one exist
> > - restart the endpoint without waiting for hcd->endpoint_reset()
> > - ignore one subsequent call to hcd->endpoint_reset()

This behavior is not correct.  See the kerneldoc for 
usb_unlink_urb() in drivers/usb/core/urb.c, especially the section 
labelled "Unlinking and Endpoint Queues".

In general, the only safe thing for class drivers to do when they get 
one of these errors on a bulk or interrupt endpoint is to unlink all the 
pending URBs for the endpoint and then call usb_clear_halt() when they 
have all completed.  I know that usb-storage does this; I suspect that 
not many other drivers do.

I suppose the USB core could take care of this automatically so that 
neither the class drivers nor the HCDs would have to worry about it.  If 
everyone agrees that this wouldn't lead to other problems, it could be 
implemented without too much difficulty.

However, as you point out, in general there's no way for the host to 
know whether the device accepted the failed transaction.  This means 
that proper recovery requires knowledge of the higher-level protocol, 
which only the class driver is aware of.

> > For STALL, I think it's a little awkward, but acceptable. The ultimate
> > result appears to be that all pending URBs are given back with -EPIPE
> > and things start moving again after usb_clear_halt().
> > 
> > But if the device isn't stalled, the next URB may execute right away if
> > the failure was transient. This makes it impossible to ensure in-order
> > delivery on bulk OUT pipes, because one URB is skipped and the driver
> > has no reliable way to retry it before some later ones may get executed.

Just so.

> The class driver will know of this error, and the retry/recovery should
> be done at the class driver base on this error and not from the HCD.

Agreed.  The minimum requirement is that the endpoint does not restart 
until all the pending URBs have completed, whether through explicit or 
automatic unlinks.

> > This behavior also creates an opportunity for toggle mismatch, and as
> > far as I understand, the hardware will resolve it by silently dropping
> > one packet. Another could be dropped if usb_clear_halt() were called.
> > 
> > 
> > Either I'm missing something, or it seems quite broken?
> > 
> > I wonder what other HCDs are doing in this case, and what's the idea
> > behind it all?

As far as I know, they don't automatically send Clear-Halt requests to 
the device or automatically unlink anything.

Alan Stern