Strange issues with UAS device - repro

Michał Pecio <michal.pecio@xxxxxxxxx> · Wed, 13 Nov 2024 00:26:58 +0100

Context:
https://lore.kernel.org/linux-usb/Ztn2ZsPtxmCTiR78@xxxxxxxxxxx/T/

Hi Mathias,

I found the holy grail - a simple way to trick the uas driver into
unlinking URBs and bringing death and destruction.

That driver almost never unlinks URBs, which is why streams unlinking
remained broken for years and no one cared, until that long and angry
patch which fixed it this summer.

The long and angry commit message suggested one reproducer: reading
a disk with bad sectors. Unfortunately, I don't keep such hardware
around and two not-broken disks I tried don't implement the commands
which would make them pretend that they are broken.

But the "strange issues" story shows a different way: certain failures
of SMART commands. I don't know details, but this is enough for me:
Intel 320 series SSD triggers it on 'smartctl -x'. And it looks like
"strange issues" reporter's disks did something similar.

So a suitable disk must be found, and also a suitable UAS enclosure,
because the first one I tried is buggy (or runs into a bug in uas?)
which causes an "Invalid Stream ID" error and end of fun. So I bought
a new one with the cheap and common JMS578 bridge and it works fine.

With this, it becomes trivial to reproduce the bug which I fixed with
"Fix TD invalidation under pending Set TR  Dequeue":

[ 9264.504123] xhci-pci-renesas 0000:03:00.0: Set TR Deq already pending, don't submit for 0x0x000000017323fe00
[ 9264.504125] xhci-pci-renesas 0000:03:00.0: Failed to clear cancelled cached URB ffff88816a293cc0, mark clear anyway
[ 9264.504127] xhci-pci-renesas 0000:03:00.0: Failed to clear cancelled cached URB ffff88812361c840, mark clear anyway
[ 9264.504128] xhci-pci-renesas 0000:03:00.0: Failed to clear cancelled cached URB ffff8881002e43c0, mark clear anyway

All it takes is running a few 'smartctl -x' in parallel, in a loop.

And I was curious to see those TRB Errors on Set TR Deq on ASM3142.
This is triggered (rarely) by 'smartctl -x' while reading the disk:

[ 4541.290234] xhci_hcd 0000:02:00.0: 5/6 (010/3) ring_ep_doorbell
[ 4541.290237] xhci_hcd 0000:02:00.0: 5/6 (010/3) ring_ep_doorbell
[ 4541.290281] xhci_hcd 0000:02:00.0: 5/6 (010/3) ring_ep_doorbell
[ 4541.380361] xhci_hcd 0000:02:00.0: 5/6 (010/1) urb_dequeue urb ffff88816a268a80 td-dma 0x000000000341ba50 stream 3
[ 4541.380368] xhci_hcd 0000:02:00.0: 5/6 (014/1) queue_stop_endpoint suspend 0
[ 4541.443637] xhci_hcd 0000:02:00.0: 5/6 (014/1) handle_tx_event comp_code 26 trb_dma 0x00000000034da400
[ 4541.443642] xhci_hcd 0000:02:00.0: Transfer event 26 for unknown stream ring slot 5 ep 6
[ 4541.443711] xhci_hcd 0000:02:00.0: 5/6 (014/3) handle_cmd_completion cmd_type 15 comp_code 1
[ 4541.443715] xhci_hcd 0000:02:00.0: 5/6 (014/3) queue_set_tr_deq stream 3 addr 0x0x000000000341ba60
[ 4541.443769] xhci_hcd 0000:02:00.0: 5/6 (011/3) handle_cmd_completion cmd_type 16 comp_code 5
[ 4541.443772] xhci_hcd 0000:02:00.0: WARN Set TR Deq Ptr cmd invalid because of stream ID configuration
[ 4541.443774] xhci_hcd 0000:02:00.0: 5/6 (010/3) ring_ep_doorbell
[ 4541.443777] xhci_hcd 0000:02:00.0: 5/6 (010/3) ring_ep_doorbell
[ 4541.443808] xhci_hcd 0000:02:00.0: 5/6 (010/3) ring_ep_doorbell

An interesting detail is that this bug is sometimes (seen twice so far)
preceded by a Stopped event which doesn't match any known stream ring.
This might possibly be our bug, or it's just the HW being broken.

Otherwise, there is nothing remarkable here. I hoped that it might be
triggered by the start/stop race, but it doesn't seems so. Fairly long
time between endpoint restarts, and yet it still happens.

This is all I have for now. I'm leaving this repro running on a Renesas
controller to see if anything pops up. I feel it's a HW bug in ASM3142.

Regards,
Michal