You are correct; it seems like this issue arises from the fact that stalled no-op trbs are ignored and the controller remains stuck in the Halted state because the Reset Endpoint command never gets run. >From what I understand, this issue is not caused by losing track of where the endpoint ring stopped, but by a race condition between killing URBs and stall errors. As such, the STALL_ERROR handling happens right after the endpoint ring is stopped. Since the trb is no-op'ed in xhci_handle_cmd_stop_ep(), the endpoint recovery code never gets run. The cdc-acm driver triggers the "perpetually halted endpoint" behavior via (what I believe are) the following operations: 1. The cdc-acm driver submits and queues 16 bulk IN URBs. 2. A full-speed device times out and fails to respond to its IN token (including retries), causing the CSPLIT transaction with its upstream hub to return a STALL. The first URB thus completes with -EPIPE. 3. Upon completion of the first URB, the cdc-acm driver kills the other 15 URBs, attempts to clear the halt condition, and resubmits all 16 URBs. 4. Every time a URB is killed, the endpoint ring is restarted and URBs that are not killed yet are executed. If we are unlucky, the URB's TRB is gets cancelled right after it has been executed but before xhci_irq() gets to run, causing this condition. My patch is simply a workaround for this specific case; it may be worth considering if there are any other tasks that need to be done in the no-op trb case. Best, Gavin On Mon, May 29, 2017 at 7:06 AM, Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> wrote: > On 26.05.2017 22:12, gavinli@xxxxxxxxxxxxxx wrote: >> >> From: Gavin Li <git@xxxxxxxxxxxxxx> >> >> If a stalling TRB is cancelled and NOOP'ed in xhci_handle_cmd_stop_ep(), >> finish_td() never gets called to reset the halted endpoint and the >> endpoint remains indefinitely stalled. This patch ensures that >> xhci_cleanup_halted_endpoint() is called after a TRB completion if the >> endpoint is halted, irregardless of the status of any pending TRBs. >> >> Signed-off-by: Gavin Li <git@xxxxxxxxxxxxxx> > > > Interesting, I'm trying to understand the details of what's going on here. > > Does the event ring first have a stopped transfer event, then a stop > endpoint > command completion event, and finally a transfer event with STALL_ERROR > completion > code pointing to a no-op trb? > > If this is the case do yo have any idea if the STALL_ERROR transfer event > was issued while the ring was stopping, or only after ring was restarted > again? > > current code relies on the the stopped transfer event to know where on the > endpoint ring hardware stopped, sometimes this might not be reliable. > > I have a change for that, just pushed to my for-usb-next branch in my tree > at: > git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git > > Your patch basically adds a check for STALL_ERROR for no-op trbs, it reveals > that the driver currently skips handling all endpoint state related transfer > events if that TRB was canceled. This needs to investigated more > > Thanks > -Mathias > -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html