My transfer ring grew to 740 segments

Michał Pecio <michal.pecio@xxxxxxxxx> · Tue, 11 Mar 2025 23:41:39 +0100

Hi,

This happened under a simple test meant to check if AMD "Promontory"
chipset (from ASMedia) has the delayed restart bug (it does, rarely).

Two pl2303 serial dongles were connected to a hub, loops were opening
and closing /dev/ttyUSBn to enqueue/dequeue some IN URBs which would
never complete with any data (nothing was fed to UART RX).

The test was running unattended for a few hours and it seems that at
some point the hub stopped working and transfers to downstream devices
were all returning Transaction Error. dmesg was full of this:

[102711.994235] xhci_hcd 0000:02:00.0: Event dma 0x00000000ffef4a50 for ep 6 status 4 not part of TD at 00000000eb22b790 - 00000000eb22b790
[102711.994243] xhci_hcd 0000:02:00.0: Ring seg 0 dma 0x00000000ffef4000
[102711.994246] xhci_hcd 0000:02:00.0: Ring seg 1 dma 0x00000000ffeee000
[102711.994249] xhci_hcd 0000:02:00.0: Ring seg 2 dma 0x00000000ffc4e000

[ ... 735 lines omitted for brevity ... ]

[102711.995935] xhci_hcd 0000:02:00.0: Ring seg 738 dma 0x00000000eb2e2000
[102711.995937] xhci_hcd 0000:02:00.0: Ring seg 739 dma 0x00000000eb22b000

Looking through debugfs, ffef4a50 is indeed a normal TD, apparently no
longer on td_list for some reason and hence the errors. The rest of the
ring is No-Ops.

Class driver enqueues its URBs, rings the doorbell and triggers this
error message. The endpoint halts, but that is ignored. Serial device
is closed, URBs are unlinked, Stop EP sees Halted, resests. No Set Deq
because HW Dequeue doesn't match any known TD. Rinse, repeat.

At some point end of the segment is reached, new segment is allocated
because ep_ring->dequeue is still in the first segment.

Sow how does the driver enter this screwed up state? Apparently due to
a HW bug. More detailed debug log from a different run:

[39607.305224] xhci_hcd 0000:02:00.0: 2/6 (040/3) ring_ep_doorbell stream 0
[39607.305235] xhci_hcd 0000:02:00.0: 2/6 (040/3) ring_ep_doorbell stream 0
[39607.305413] xhci_hcd 0000:02:00.0: 2/6 (040/1) handle_tx_event comp_code 4 trb_dma 0x00000000ffa80050

The 1 in (040/1) is EP Ctx state, i.e. Running, despite Trans. Error.
It looks like finish_td() sees it, ignores the error and gives back
normally. EP Ctx is still wrong later when the next URB is unlinked:

[39607.398526] xhci_hcd 0000:02:00.0: 2/6 (040/1) xhci_urb_dequeue cancel TD at 0x00000000ffa80060 stream 0
[39607.398531] xhci_hcd 0000:02:00.0: 2/6 (044/1) queue_stop_endpoint suspend 0

But Stop EP fails and updates it properly to 2=Halted:

[39607.398655] xhci_hcd 0000:02:00.0: 2/6 (044/2) handle_cmd_completion cmd_type 15 comp_code 19

Then the EP is reset without Set Deq or clearing and ffa80050 becomes
"stuck and forgotten", initiating the above problem.

The fact that EP Ctx state is Running for >90ms suggests that it's
a bug. But a race could have similar effect, and I can't find any
guarantee in the spec that EP Ctx is updated before posting an error
transfer event. 4.8.3 guarantees that it becomes Running before normal
transfer events are posted, but suggests not to trust EP Ctx too much.

Maybe finish_td() should be more cautious?

Michal