Re: [PATCH 0/2] Fix the NEC stop bug workaround

Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> · Tue, 15 Oct 2024 15:23:23 +0300

On 14.10.2024 22.08, Michal Pecio wrote:
Hi,

I found an unfortunate problem with my workaround for this hardware bug.

To recap, Stop Endpoint sometimes fails, the Endpoint Context says the
EP is Stopped, but cancelled TRBs are still executed. I found this bug
earlier this year and submitted a workaround, which retries the command
(sometimes a few times) and all is good.

This works fine for common cases, but what if the endpoint is really
stopped? Then Stop Endpoint is supposed to fail and fail it does. The
workaround code doesn't know that it happened and retries infinitely.

I have never seen it in normal use, but I devised a reliable repro.
The effect isn't pretty - no URBs can be cancelled, device gets stuck,
if unplugged it locks up connections/disconnections on the whole bus.

With some experimentation I found that the bug is a variant of the old
"stop after restart" issue - the doorbell ring is internally reordered
after the subsequent command. By busy-waiting I confirmed that EP state
which is initially seen as Stopped becomes Running some time later.

Seems host controllers aren't designed to stop, move dequeue, and restart
an endpoint in quick succession.

In addition to fixing this NEC case we could think about avoiding these
cases, some could be avoided by adding a new ".flush_endpoint()" callback to
the USB host side API. Usb core itself has a usb_hcd_flush_endpoint() function
that calls .urb_dequeue() in a loop for each queued URB, causing host to
issue the stop, move deq and ring doorbell for every URB.

If usbcore knows all URBs will be cancelled it could let host do it in one go.
i.e. stop endpoint once.

Thanks
Mathias