Re: How are halted endpoints supposed to be handled in Linux?

Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> · Thu, 21 Nov 2024 16:08:15 +0200

On 21.11.2024 12.26, Michał Pecio wrote:
Hi Alan,

Thank you for taking the time to answer. I'm beginning to regret not
asking this question earlier.

On Wed, 20 Nov 2024 21:31:29 -0500, Alan Stern wrote:
Linux appears to ignore this part and only reset on STALL
handshake, as advised in
Documentation/driver-api/usb/error-codes.rst and practiced by
drivers - they don't seem to bother with usb_clear_halt() on
-EPROTO.

For xhci I assumed that the device endpoint is halted when we receive
a  'Stall Error' transfer event for bulk or interrupt transfers.

Other errors such as 'Transaction Error' do halt the host side,
but does not necessarily mean whole endpoint is halted. This is based
on the info in xhci 4.6.8.1 "Soft Retry" :

"xHC shall halt an endpoint with a USB Transaction Error after CErr
retries have been performed. The USB device is not aware that the xHC
has halted the endpoint, and will be waiting for another retry, so a
Soft Retry may be used to perform additional retries and recover from
an error which has caused the xHC to halt an endpoint."

This is generally a weakness in the drivers.  It would be nice if the
class specifications said what to do in these situations, but most of
them don't.

There is also proprietary hardware with no specification at all.

On the HCD side, xHCI will:
- give back the current URB with -EPROTO/-EPIPE status
- reset the host side endpoint, clearing its toggle state
- point the HC at the next URB if one exist
- restart the endpoint without waiting for hcd->endpoint_reset()

Intention was not to restart the endpoint, but turns out we end up doing it.

Basically we should not ring the doorbell  when 'Set TR Deq' command completes
for a bulk or interrupt endpoint in case the command was queued to resolve a
Stall Error. For control endpoint we should restart the ring (xHC halts
the internal control endpoint on errors/stall)

- ignore one subsequent call to hcd->endpoint_reset()

This behavior is not correct.  See the kerneldoc for
usb_unlink_urb() in drivers/usb/core/urb.c, especially the section
labelled "Unlinking and Endpoint Queues".

OK, let's go through it.

  * But when an URB terminates with an
  * error its queue generally stops (see below), at least until that URB's
  * completion routine returns.

I don't see this working after xHCI adopted bottom half giveback, which
is asynchronous. As you are the maintainer of EHCI, which also uses BH,
how is EHCI dealing with it?

One way I see with existing APIs is to wait until the driver submits a
new URB, but are drivers prepared for this? Is EHCI doing the same?

Using a BH also means class driver may queue a new URB to xhci after xhci has
cleared its internal endpoint halt condition, but before class driver is
aware that endpoint halted.

  * It is guaranteed that a stopped queue
  * will not restart until all its unlinked URBs have been fully retired,
  * with their completion routines run

I think xHCI can currently guarantee that nothing is restarted until
any URB unlinked for any reason is given to the BH worker, but that's
all we have, and I just broke it in usb-next:

+               /* Try to restart the endpoint if all is done */
+               ring_doorbell_for_active_rings(xhci, slot_id, ep_index);
+               /* Start giving back any TDs invalidated above */
+               xhci_giveback_invalidated_tds(ep);

This is part of a legitimate bugfix patch tagged for stable. I should
have insisted on keeping it a separate change, but it seemed a good idea
at the time which would soon get implemented anyway... Maybe no more?

I don't think your patch makes the existing issue worse in any significant
way.

Thanks
Mathias