Re: Inquiry about the f_tcm: Enhance UASP driver work

Michał Pecio <michal.pecio@xxxxxxxxx> · Sat, 23 Nov 2024 15:25:12 +0100

On Sat, 23 Nov 2024 00:02:10 +0000, Thinh Nguyen wrote:
> > Long delays I have seen mainly on some unfortunate pairings of HC
> > and device (HW bugs?) which trigger unusual error conditions poorly
> > handled by xhci_hcd. Try with dynamic debug on
> > handle_transferless_tx_event(), if your kernel is recent enough for
> > that to be a separate function.  
> 
> No, this delay is not a HW bug. When there's transaction error, the
> xHCI driver will reset the endpoint. The packet sequence number is
> reset and out of sync with the device. The next packet cannot proceed
> until there's some sort of recovery. There's no usb_clear_halt() or
> port reset immediately after a -EPROTO. The only recovery (port
> reset) will happen is after a timeout.

I think you are right. I tried to repro and I got this:

[Nov23 14:01] xhci-pci-renesas 0000:03:00.0: Transfer error for slot 1 ep 6 on endpoint
[  +0.000380] xhci-pci-renesas 0000:03:00.0: Transfer error for slot 1 ep 6 on endpoint
[ +30.096820] sd 6:0:0:0: [sdb] tag#1 uas_eh_abort_handler 0 uas-tag 2 inflight: IN 
[  +0.000006] sd 6:0:0:0: [sdb] tag#1 CDB: opcode=0x28 28 00 02 d0 30 08 00 02 00 00
[  +0.012009] scsi host6: uas_eh_device_reset_handler start
[  +0.114634] usb 13-2: reset SuperSpeed USB device number 6 using xhci-pci-renesas
[  +0.017603] scsi host6: uas_eh_device_reset_handler success
[  +0.000072] sd 6:0:0:0: [sdb] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=DRIVER_OK cmd_age=30s
[  +0.000003] sd 6:0:0:0: [sdb] tag#1 CDB: opcode=0x28 28 00 02 d0 30 08 00 02 00 00
[  +0.000001] I/O error, dev sdb, sector 47198216 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 0

I will keep it running for a few more hours and if those timeouts
keep happening I will have to conclude that I remembered wrong.

> > before resetting, but the whole endpoint is stopped and nothing
> > moves forward. At least that's the impression I got, I was looking
> > at other things.

But a completely stopped endpoint is *also* possible if you encounter
COMP_INVALID_STREAM_ID. I see it after some command errors on this chip:

13fd:5910 Initio Corporation

> Perhaps this can be enhanced in the future in the storage class
> driver regarding -EPROTO recovery.

It's a universal problem with xhci_hcd, it always resets the host
sequence state on every error, which is against Linux convention,
so nobody expects it and nobody handles it. It's nuts.

One thing I'm going to try is patch it to stop doing this and see
what happens.

Regards,
Michal