Sorry for the delay in response. I was away. On Fri, Jun 16, 2023, Jakub Vaněk wrote: > Hi all, > > I've discovered that on recent kernels the xHCI controller on Odroid > HC2 dies when a USB-attached disk is put under a heavy I/O load. > > The hardware in question is using a DWC3 2.00a IP within the Exynos5422 Just want to clarify, this is dwc_usb3 v2.00a and not dwc_usb31. > to provide two internal USB3 ports. One of them is connected to a > JMS578 USB-to-SATA bridge (Odroid firmware v173.01.00.02). The bridge > is then connected to a Intel SSDSC2KG240G8 (firmware XCV10132). > > The crash can be triggered by running a read-heavy workload. This > triggers it for me within tens of seconds: > > $ fio --filename=/dev/sda --direct=1 --rw=read --bs=4k \ > --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \ > --time_based --group_reporting --name=iops-test-job \ > --eta-newline=1 --readonly > > FIO output then follows this pattern: > > iops-test-job: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > 4096B-4096B, ioengine=libaio, iodepth=256 > ... > fio-3.16 > Starting 4 processes > Jobs: 4 (f=4): [R(4)][2.5%][r=341MiB/s][r=87.2k IOPS][eta 01m:57s] > Jobs: 4 (f=4): [R(4)][4.2%][r=340MiB/s][r=87.1k IOPS][eta 01m:55s] > Jobs: 4 (f=4): [R(4)][5.8%][r=337MiB/s][r=86.2k IOPS][eta 01m:53s] > Jobs: 4 (f=4): [R(4)][7.5%][r=369MiB/s][r=94.5k IOPS][eta 01m:51s] > Jobs: 4 (f=4): [R(4)][9.2%][r=364MiB/s][r=93.2k IOPS][eta 01m:49s] > Jobs: 4 (f=4): [R(4)][10.8%][r=363MiB/s][r=92.9k IOPS][eta 01m:47s] > Jobs: 4 (f=4): [R(4)][12.5%][r=348MiB/s][r=88.0k IOPS][eta 01m:45s] > Jobs: 4 (f=4): [R(4)][14.2%][r=348MiB/s][r=88.0k IOPS][eta 01m:43s] > Jobs: 4 (f=4): [R(4)][15.8%][r=377MiB/s][r=96.4k IOPS][eta 01m:41s] > Jobs: 4 (f=4): [R(4)][17.5%][r=372MiB/s][r=95.2k IOPS][eta 01m:39s] > Jobs: 4 (f=4): [R(4)][18.3%][r=77.0MiB/s][r=19.0k IOPS][eta 01m:38s] > Jobs: 4 (f=4): [R(4)][20.0%][eta 01m:36s] > < line without progress repeated many times; xHC is now unresponsive > > Jobs: 4 (f=4): [R(4)][45.8%][eta 01m:05s] > fio: io_u error on file /dev/sda: No such device: read > offset=1820839936, buflen=4096 > fio: pid=1863, err=19/file:io_u.c:1787, func=io_u error, error=No such > device > < and so on > > > Dmesg contains the following output: > > [ 266.310767] xhci-hcd xhci-hcd.8.auto: xHCI host controller not > responding, assume dead > [ 266.317388] xhci-hcd xhci-hcd.8.auto: HC died; cleaning up > [ 266.322710] usb 4-1: cmd cmplt err -108 > [ 266.326497] usb 4-1: cmd cmplt err -108 > [ 266.330313] usb 4-1: cmd cmplt err -108 > [ 266.334096] usb 4-1: cmd cmplt err -108 > [ 266.337942] usb 4-1: cmd cmplt err -108 > [ 266.341746] usb 4-1: cmd cmplt err -108 > [ 266.345561] usb 4-1: cmd cmplt err -108 > [ 266.349372] usb 4-1: cmd cmplt err -108 > [ 266.353187] usb 4-1: cmd cmplt err -108 > [ 266.357000] usb 4-1: cmd cmplt err -108 > [ 266.360809] usb 4-1: cmd cmplt err -108 > [ 266.364626] usb 4-1: cmd cmplt err -108 > [ 266.368439] usb 4-1: cmd cmplt err -108 > [ 266.372248] usb 4-1: cmd cmplt err -108 > [ 266.376063] usb 4-1: cmd cmplt err -108 > [ 266.379876] usb 4-1: cmd cmplt err -108 > [ 266.383688] usb 4-1: cmd cmplt err -108 > [ 266.387500] usb 4-1: cmd cmplt err -108 > [ 266.391314] usb 4-1: cmd cmplt err -108 > [ 266.395127] usb 4-1: cmd cmplt err -108 > [ 266.398943] usb 4-1: cmd cmplt err -108 > [ 266.402753] usb 4-1: cmd cmplt err -108 > [ 266.406565] usb 4-1: cmd cmplt err -108 > [ 266.410379] usb 4-1: cmd cmplt err -108 > [ 266.414165] usb 4-1: cmd cmplt err -108 > [ 266.418003] usb 4-1: cmd cmplt err -108 > [ 266.448629] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd > 1, flush 0, corrupt 0, gen 0 > < more FS errors follow > > > The OS is then unable to recover (I have rootfs on that SSD too) and > the board must be manually restarted. > > I can reproduce the problem on mainline 6.4-rc6 with multi_v7_defconfig > (+ CONFIG_BTRFS=y for the rootfs). I've bisected the problem a while > back and the first broken commit is b138e23d3dff ("usb: dwc3: core: > Enable AutoRetry feature in the controller"). Reverting this commit > locally makes my board stable again (FIO test above can run > for >10 minutes without any issues). This info helps a lot. > > The crash is happening when the USB-SATA bridge is controlled by the > uas driver. I have not tested the usb-storage driver yet. > > What do you think would be an appropriate fix here? One idea I had is > to add a Odroid-specific quirk to DWC3 to not enable AutoRetry here. > However, I'm not entirely sure this is isolated to Odroid boards. > > Please let me know if you need me to do some more experiments. > This failure indicates that whichever device you're testing against could not retry with burst (NumP != 0) after a CRC error. After a period of time, the host timed out and attempted to restore its operations by stoping the active transfers with a Stop-ep command. However, for some reason, the host doesn't respond to this command. The crash you observed is probably a separate issue. The main issue is why the host doesn't receive a command completion event. If you're our direct customer, you can submit a STAR request for our support. I'm not aware of this type of failure related to AutoRetry. However, given how old this controller version is (over a decade ago), I can't be sure. I think if you try to test against a different device, you may not observe this same failure. To resolve this, please look into our support team to investigate further to see whether it's a setup issue. Otherwise, we can disable this feature for dwc_usb3 v2.00a. Depending on how bad the CRC error rate is (which should be low), this should not affect performance much. I don't think this neccessarily needs a new DT property. Something like this: diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c index 0beaab932e7d..1bfd8b127240 100644 --- a/drivers/usb/dwc3/core.c +++ b/drivers/usb/dwc3/core.c @@ -1209,8 +1209,9 @@ static int dwc3_core_init(struct dwc3 *dwc) dwc3_writel(dwc->regs, DWC3_GUCTL1, reg); } - if (dwc->dr_mode == USB_DR_MODE_HOST || - dwc->dr_mode == USB_DR_MODE_OTG) { + if (!DWC3_VER_IS(DWC3, 200A) && + (dwc->dr_mode == USB_DR_MODE_HOST || + dwc->dr_mode == USB_DR_MODE_OTG)) { reg = dwc3_readl(dwc->regs, DWC3_GUCTL); /* Thanks, Thinh