Re: My transfer ring grew to 740 segments

Michał Pecio <michal.pecio@xxxxxxxxx> · Thu, 13 Mar 2025 08:54:47 +0100




On Wed, 12 Mar 2025 15:37:12 +0200, Mathias Nyman wrote:
> On 12.3.2025 0.41, Michał Pecio wrote:
> > Class driver enqueues its URBs, rings the doorbell and triggers this
> > error message. The endpoint halts, but that is ignored. Serial
> > device is closed, URBs are unlinked, Stop EP sees Halted, resests.
> > No Set Deq because HW Dequeue doesn't match any known TD. Rinse,
> > repeat.  
> 
> Ok, so this means endpoint does get reset, and once restarted it
> tries to transfer the same TRB, which again fails with Transaction
> error.

Yes. It makes me wonder whether it even makes sense to reset endpoints
in cases when the halted TD cannot be identified and skipped with Set
TR Dequeue. We don't know if it got td_to_noop(), and even if it did,
there is no guarantee that the HC flushes TRB cache before retrying.

If connection is lost permanently, this situation at least is safe.
But if it's a temporary Transaction Error or Stall then a future URB
may cause this stale TD to execute, affecting device state without
class driver's knowledge and using a retired data buffer.

Since every known halted TD is cancelled rather than given back, having
a halted EP with no TD to blame appears to generally be a bug. In this
case, finish_td() failed to recognize and handle the halt. And papering
over this problem with a reset didn't make it go away.

> > Maybe finish_td() should be more cautious?  
> 
> Good point, finish_td() should probably trust ep_state flags set by
> driver first.

Actually, finish_td() is supposed to call xhci_handle_halted_endpoint()
which then sets EP_HALTED. It could do so more reliably by trusting the
spec and assuming that every Tr-Error or Babble halts the endpoint
(with exceptions for isochronous and babbling 0.95 control endpoints).

4.8.3 instructs to assume that EP is halted after these errors and
warns that EP Ctx may not always be up to date, although Promontory
seems to (randomly) never update it at all, even 90ms later.

For now, I tried this simple hack and it solved the Promontory problem.
The message gets printed sometimes, but nothing worse happens.

@@ -2254,8 +2254,8 @@ static void finish_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
                                                 td->start_seg, td->start_trb));
                                return;
                        }
-                       /* endpoint not halted, don't reset it */
-                       break;
+                       xhci_info(xhci, "slot %d ep %d comp_code %d but not halted?\n",
+                                       ep->vdev->slot_id, ep->ep_index, trb_comp_code);
                }
                /* Almost same procedure as for STALL_ERROR below */
                xhci_clear_hub_tt_buffer(xhci, td, ep);

BTW, I'm reproducing this bug in a much simpler way, not involving any
dodgy hub. I use a full speed device (a PL2303 serial dongle) and
disconnect its D- line after enumeration. This brakes communications,
but disconnection is not reported because D+ line is still pulled up.