On 14.10.2024 22.08, Michal Pecio wrote:
Hi, I found an unfortunate problem with my workaround for this hardware bug. To recap, Stop Endpoint sometimes fails, the Endpoint Context says the EP is Stopped, but cancelled TRBs are still executed. I found this bug earlier this year and submitted a workaround, which retries the command (sometimes a few times) and all is good. This works fine for common cases, but what if the endpoint is really stopped? Then Stop Endpoint is supposed to fail and fail it does. The workaround code doesn't know that it happened and retries infinitely. I have never seen it in normal use, but I devised a reliable repro. The effect isn't pretty - no URBs can be cancelled, device gets stuck, if unplugged it locks up connections/disconnections on the whole bus. With some experimentation I found that the bug is a variant of the old "stop after restart" issue - the doorbell ring is internally reordered after the subsequent command. By busy-waiting I confirmed that EP state which is initially seen as Stopped becomes Running some time later.
Seems host controllers aren't designed to stop, move dequeue, and restart an endpoint in quick succession. In addition to fixing this NEC case we could think about avoiding these cases, some could be avoided by adding a new ".flush_endpoint()" callback to the USB host side API. Usb core itself has a usb_hcd_flush_endpoint() function that calls .urb_dequeue() in a loop for each queued URB, causing host to issue the stop, move deq and ring doorbell for every URB. If usbcore knows all URBs will be cancelled it could let host do it in one go. i.e. stop endpoint once. Thanks Mathias