On Thu, 21 Nov 2024 16:08:15 +0200, Mathias Nyman wrote: > "xHC shall halt an endpoint with a USB Transaction Error after CErr > retries have been performed. The USB device is not aware that the xHC > has halted the endpoint, and will be waiting for another retry, so a > Soft Retry may be used to perform additional retries and recover from > an error which has caused the xHC to halt an endpoint." That's why it would be "nice to have" the ability to allow class drivers to perform further "soft retries" by killing all URBs and submitting them (together with the failed one) again. Or by some other means, as this is a rather crude API. I imagine it would be the most sensible option for some devices like serial, although I don't know if any driver currently attempts it. If this is to work reliably, the sequence state apparently needs to be preserved on both sides and not reset, regardless of what the USB spec says. So it would take a TSP=1 reset on -EPROTO and full reset on -EPIPE (no other choice here), each followed by Set Deq. If xHCI wants or needs to keep doing full reset on -EPROTO then CLEAR_FEATURE(ENDPOINT_HALT) seems like a good addition. It does have the "double delivery after missing ACK" problem with class drivers unable or unwilling to determine the true state of their device, but avoids guaranteed toggle mismatch half the time. > >>>> On the HCD side, xHCI will: > >>>> - give back the current URB with -EPROTO/-EPIPE status > >>>> - reset the host side endpoint, clearing its toggle state > >>>> - point the HC at the next URB if one exist > >>>> - restart the endpoint without waiting for hcd->endpoint_reset() > > Intention was not to restart the endpoint, but turns out we end up > doing it. > > Basically we should not ring the doorbell when 'Set TR Deq' command > completes for a bulk or interrupt endpoint in case the command was > queued to resolve a Stall Error. According to Alan's kerneldoc, also for -EPROTO and ordinary unlink(). So practically always. Apparently, no endpoint stopped for any reason is supposed to restart with any complete() callbacks still pending. I imagine going back to direct giveback in cases of unlink/halt might be an option, maybe even more reliable than waiting for a new URB. > >>>> - ignore one subsequent call to hcd->endpoint_reset() It's also a bug that EP_HARD_CLEAR_TOGGLE isn't cleared when a new URB is queued on the endpoint. Toggle state becomes unknown then. > Using a BH also means class driver may queue a new URB to xhci after > xhci has cleared its internal endpoint halt condition, but before > class driver is aware that endpoint halted. In fact, they can do even more. Network drivers with a simple complete- resubmit loop can schedule a work to resubmit later if their GFP_ATOMIC allocation of a fresh data buffer fails. More generally, drivers are under no obligation to only resubmit URBs from complete() callbacks. I suppose the combination of BH giveback and restarting on a new URB is fundamentally unable to provide documented guarantees, on EHCI too. > + /* Try to restart the endpoint if all is done */ > + ring_doorbell_for_active_rings(xhci, slot_id, ep_index); > + /* Start giving back any TDs invalidated above */ > + xhci_giveback_invalidated_tds(ep); > > I don't think your patch makes the existing issue worse in any > significant way. Currently, it's just a sub-microsecond differenece in an existing race. But if we are to eliminate the race, this code needs reordering. Regards, Michal