Re: Inquiry about the f_tcm: Enhance UASP driver work

Thinh Nguyen <Thinh.Nguyen@xxxxxxxxxxxx> · Sat, 23 Nov 2024 00:02:10 +0000

On Fri, Nov 22, 2024, Michał Pecio wrote:
> Hi,
> 
> > > I tried to use it on dwc3, though I fix some other problems, the host side xhci
> > > (ubuntu client) using fio for stress testing, and I encountered the following
> > > error on host:
> > > [18836.092159] xhci_hcd 0000:00:0d.0: Transfer error for slot 3 ep 1 on
> > > endpoint
> > > [18836.092211] sd 0:0:0:0: [sda] tag#11 data cmplt err -71 uas-tag 1 inflight:
> > > CMD
> > > [18836.092213] sd 0:0:0:0: [sda] tag#11 CDB: Write(10) 2a 00 02 5e 31 00 00 01
> > > 00 00
> > > .....more and mores....
> > > [18867.369118] scsi host0: uas_eh_device_reset_handler start [18867.453796] usb
> > > 2-3.2: reset SuperSpeed USB device number 4 using xhci_hcd
> > > and the gadget side is keep resetting configfs and printing wait_for_completion
> > > timeout (since dwc3 have )
> > > 
> > > I am not sure whether this is due to the stream exception of dwc3 or some
> > > logical in f_tcm and target.
> > 
> > The error is -71. This is transaction error (could be a CRC error). It
> > could be due to the host, device hardware, electrical interference, or
> > even the cable. No logical issue from software.
> 
> A transaction error is a transaction error, but waiting 30 seconds for
> UAS to reset the device afterwards looks wrong. I seem to recall seeing
> sporadic transaction errors which triggered the reset instantly.

That's not what happening. I don't recall the storage class handles
transaction error as such. It just waits for the scsi command timeout.

> 
> Long delays I have seen mainly on some unfortunate pairings of HC and
> device (HW bugs?) which trigger unusual error conditions poorly handled
> by xhci_hcd. Try with dynamic debug on handle_transferless_tx_event(),
> if your kernel is recent enough for that to be a separate function.

No, this delay is not a HW bug. When there's transaction error, the xHCI
driver will reset the endpoint. The packet sequence number is reset and
out of sync with the device. The next packet cannot proceed until
there's some sort of recovery. There's no usb_clear_halt() or port reset
immediately after a -EPROTO. The only recovery (port reset) will happen
is after a timeout.

> 
> In those cases, UAS seems to wait for other streams to complete before
> resetting, but the whole endpoint is stopped and nothing moves forward.
> At least that's the impression I got, I was looking at other things.
> 
> If you aren't running into this case, I would say something may be wrong
> with UAS implementation on one or the other side.
> 
> It looks like the transaction error was delivered to UAS by means of
> -EPROTO status so xhci_hcd has done its job at least for this one URB.
> No idea what happened later and why the device wasn't reset promptly.
> 

The host doesn't tell the device to reset until after a timeout. There's
no sync'ing mechanism, so it wouldn't know how to recover. All it can
tell from the device side is it's waiting for the transfer to complete.
Perhaps this can be enhanced in the future in the storage class driver
regarding -EPROTO recovery.

BR,
Thinh