On 14.10.2024 22.10, Michal Pecio wrote:
The NEC uPD720200 has a bug, which prevents reliably stopping
an endpoint shortly after it has been restarted. This usually
happens when a driver kills many URBs in quick succession and
it results in concurrent execution and cancellation of TDs.
This is handled by stopping the endpoint again if in doubt.
This "doubt" turns out to be a problem, because Stop Endpoint
may be queued when the EP is already Stopped (for Set TR Deq
execution, for example) or becomes Stopped concurrently (by
Reset Endpoint, for example). If the EP is truly Stopped, the
command fails and further retries just keep failing forever.
This is easily triggered by modifying uvcvideo to unlink its
isochronous URBs in 100us intervals instead of poisoning them.
Any driver that unlinks URBs asynchronously may trigger this,
and any URB unlink during ongoing halt recovery also can.
Fix the problem by tracking redundant Stop Endpoint commands
which are sure to fail, and by not retrying them. It's easy,
because xhci_urb_dequeue() is the only user ever queuing the
command with the default handler and without ensuring that
the endpoint is Running and will not Halt before it Stops.
For this case, we assume that an endpoint with pending URBs
is always Running, unless certain operations are pending on
it which indicate known exceptions.
Note that we need to catch those exceptions when they occur,
because their flags may be cleared before our handler runs.
It's possible that other HCs have similar bugs (see also the
related "Running" case below), but the workaround is limited
to NEC because no such chips are currently known and tested.
Fixes: fd9d55d190c0 ("xhci: retry Stop Endpoint on buggy NEC controllers")
Signed-off-by: Michal Pecio <michal.pecio@xxxxxxxxx>
---
drivers/usb/host/xhci-ring.c | 44 +++++++++++++++++++++++++++++++++---
drivers/usb/host/xhci.h | 2 ++
2 files changed, 43 insertions(+), 3 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 4d664ba53fe9..c0efb4d34ab9 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -911,6 +911,21 @@ static int xhci_reset_halted_ep(struct xhci_hcd *xhci, unsigned int slot_id,
return ret;
}
+/*
+ * A Stop Endpoint command is redundant if the EP is not in the Running state.
+ * It will fail with Context State Error. We sometimes queue redundant Stop EP
+ * commands when the EP is held Stopped for Set TR Deq execution, or Halted.
+ * A pending Stop Endpoint command *becomes* redundant if the EP halts before
+ * its completion, and this flag needs to be updated in those cases too.
+ */
+static void xhci_update_stop_cmd_redundant(struct xhci_virt_ep *ep)
+{
+ if (ep->ep_state & (SET_DEQ_PENDING | EP_HALTED | EP_CLEARING_TT))
+ ep->ep_state |= EP_STOP_CMD_REDUNDANT;
+ else
+ ep->ep_state &= ~EP_STOP_CMD_REDUNDANT;
+}
+
static int xhci_handle_halted_endpoint(struct xhci_hcd *xhci,
struct xhci_virt_ep *ep,
struct xhci_td *td,
@@ -946,6 +961,7 @@ static int xhci_handle_halted_endpoint(struct xhci_hcd *xhci,
return err;
ep->ep_state |= EP_HALTED;
+ xhci_update_stop_cmd_redundant(ep);
xhci_ring_cmd_db(xhci);
@@ -1149,15 +1165,31 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
break;
ep->ep_state &= ~EP_STOP_CMD_PENDING;
return;
+
case EP_STATE_STOPPED:
/*
- * NEC uPD720200 sometimes sets this state and fails with
- * Context Error while continuing to process TRBs.
- * Be conservative and trust EP_CTX_STATE on other chips.
+ * Per xHCI 4.6.9, Stop Endpoint command on a Stopped
+ * EP is a Context State Error, and EP stays Stopped.
+ * The EP could be stopped by some concurrent job, so
+ * ignore this error when that's the case.
+ */
+ if (ep->ep_state & EP_STOP_CMD_REDUNDANT)
+ break;
Can we skip the new flag and just check for the correct flags here directly?
if (ep->ep_state & (SET_DEQ_PENDING | EP_HALTED | EP_CLEARING_TT)
break;
Thanks
Mathias