Re: How are halted endpoints supposed to be handled in Linux?

Michał Pecio <michal.pecio@xxxxxxxxx> · Fri, 22 Nov 2024 12:54:01 +0100

On Thu, 21 Nov 2024 16:08:15 +0200, Mathias Nyman wrote:
> "xHC shall halt an endpoint with a USB Transaction Error after CErr
> retries have been performed. The USB device is not aware that the xHC
> has halted the endpoint, and will be waiting for another retry, so a
> Soft Retry may be used to perform additional retries and recover from
> an error which has caused the xHC to halt an endpoint."

That's why it would be "nice to have" the ability to allow class
drivers to perform further "soft retries" by killing all URBs and
submitting them (together with the failed one) again. Or by some
other means, as this is a rather crude API.

I imagine it would be the most sensible option for some devices like
serial, although I don't know if any driver currently attempts it.

If this is to work reliably, the sequence state apparently needs
to be preserved on both sides and not reset, regardless of what the
USB spec says. So it would take a TSP=1 reset on -EPROTO and full
reset on -EPIPE (no other choice here), each followed by Set Deq.

If xHCI wants or needs to keep doing full reset on -EPROTO then
CLEAR_FEATURE(ENDPOINT_HALT) seems like a good addition. It does
have the "double delivery after missing ACK" problem with class
drivers unable or unwilling to determine the true state of their
device, but avoids guaranteed toggle mismatch half the time.

> >>>> On the HCD side, xHCI will:
> >>>> - give back the current URB with -EPROTO/-EPIPE status
> >>>> - reset the host side endpoint, clearing its toggle state
> >>>> - point the HC at the next URB if one exist
> >>>> - restart the endpoint without waiting for hcd->endpoint_reset()
> 
> Intention was not to restart the endpoint, but turns out we end up
> doing it.
> 
> Basically we should not ring the doorbell  when 'Set TR Deq' command
> completes for a bulk or interrupt endpoint in case the command was
> queued to resolve a Stall Error.

According to Alan's kerneldoc, also for -EPROTO and ordinary unlink().
So practically always. Apparently, no endpoint stopped for any reason
is supposed to restart with any complete() callbacks still pending.

I imagine going back to direct giveback in cases of unlink/halt might
be an option, maybe even more reliable than waiting for a new URB. 

> >>>> - ignore one subsequent call to hcd->endpoint_reset()

It's also a bug that EP_HARD_CLEAR_TOGGLE isn't cleared when a new
URB is queued on the endpoint. Toggle state becomes unknown then.

> Using a BH also means class driver may queue a new URB to xhci after
> xhci has cleared its internal endpoint halt condition, but before
> class driver is aware that endpoint halted.

In fact, they can do even more. Network drivers with a simple complete-
resubmit loop can schedule a work to resubmit later if their GFP_ATOMIC
allocation of a fresh data buffer fails. More generally, drivers are
under no obligation to only resubmit URBs from complete() callbacks.

I suppose the combination of BH giveback and restarting on a new URB
is fundamentally unable to provide documented guarantees, on EHCI too.

> +               /* Try to restart the endpoint if all is done */
> +               ring_doorbell_for_active_rings(xhci, slot_id, ep_index);
> +               /* Start giving back any TDs invalidated above */
> +               xhci_giveback_invalidated_tds(ep);
>
> I don't think your patch makes the existing issue worse in any
> significant way.

Currently, it's just a sub-microsecond differenece in an existing race.
But if we are to eliminate the race, this code needs reordering.

Regards,
Michal