Hi Thinh, thank you for your reply! On Thu, 2023-06-22 at 22:33 +0000, Thinh Nguyen wrote: > Sorry for the delay in response. I was away. > > On Fri, Jun 16, 2023, Jakub Vaněk wrote: > > Hi all, > > > > I've discovered that on recent kernels the xHCI controller on Odroid > > HC2 dies when a USB-attached disk is put under a heavy I/O load. > > > > The hardware in question is using a DWC3 2.00a IP within the Exynos5422 > > Just want to clarify, this is dwc_usb3 v2.00a and not dwc_usb31. Indeed, I forgot to add this. > > to provide two internal USB3 ports. One of them is connected to a > > JMS578 USB-to-SATA bridge (Odroid firmware v173.01.00.02). The bridge > > is then connected to a Intel SSDSC2KG240G8 (firmware XCV10132). > > > > The crash can be triggered by running a read-heavy workload. This > > triggers it for me within tens of seconds: > > > > $ fio --filename=/dev/sda --direct=1 --rw=read --bs=4k \ > > --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \ > > --time_based --group_reporting --name=iops-test-job \ > > --eta-newline=1 --readonly > > > > FIO output then follows this pattern: > > > > iops-test-job: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > > 4096B-4096B, ioengine=libaio, iodepth=256 > > ... > > fio-3.16 > > Starting 4 processes > > Jobs: 4 (f=4): [R(4)][2.5%][r=341MiB/s][r=87.2k IOPS][eta 01m:57s] > > Jobs: 4 (f=4): [R(4)][4.2%][r=340MiB/s][r=87.1k IOPS][eta 01m:55s] > > Jobs: 4 (f=4): [R(4)][5.8%][r=337MiB/s][r=86.2k IOPS][eta 01m:53s] > > Jobs: 4 (f=4): [R(4)][7.5%][r=369MiB/s][r=94.5k IOPS][eta 01m:51s] > > Jobs: 4 (f=4): [R(4)][9.2%][r=364MiB/s][r=93.2k IOPS][eta 01m:49s] > > Jobs: 4 (f=4): [R(4)][10.8%][r=363MiB/s][r=92.9k IOPS][eta 01m:47s] > > Jobs: 4 (f=4): [R(4)][12.5%][r=348MiB/s][r=88.0k IOPS][eta 01m:45s] > > Jobs: 4 (f=4): [R(4)][14.2%][r=348MiB/s][r=88.0k IOPS][eta 01m:43s] > > Jobs: 4 (f=4): [R(4)][15.8%][r=377MiB/s][r=96.4k IOPS][eta 01m:41s] > > Jobs: 4 (f=4): [R(4)][17.5%][r=372MiB/s][r=95.2k IOPS][eta 01m:39s] > > Jobs: 4 (f=4): [R(4)][18.3%][r=77.0MiB/s][r=19.0k IOPS][eta 01m:38s] > > Jobs: 4 (f=4): [R(4)][20.0%][eta 01m:36s] > > < line without progress repeated many times; xHC is now unresponsive > > > Jobs: 4 (f=4): [R(4)][45.8%][eta 01m:05s] > > fio: io_u error on file /dev/sda: No such device: read > > offset=1820839936, buflen=4096 > > fio: pid=1863, err=19/file:io_u.c:1787, func=io_u error, error=No such > > device > > < and so on > > > > > Dmesg contains the following output: > > > > [ 266.310767] xhci-hcd xhci-hcd.8.auto: xHCI host controller not > > responding, assume dead > > [ 266.317388] xhci-hcd xhci-hcd.8.auto: HC died; cleaning up > > [ 266.322710] usb 4-1: cmd cmplt err -108 > > [ 266.326497] usb 4-1: cmd cmplt err -108 > > [ 266.330313] usb 4-1: cmd cmplt err -108 > > [ 266.334096] usb 4-1: cmd cmplt err -108 > > [ 266.337942] usb 4-1: cmd cmplt err -108 > > [ 266.341746] usb 4-1: cmd cmplt err -108 > > [ 266.345561] usb 4-1: cmd cmplt err -108 > > [ 266.349372] usb 4-1: cmd cmplt err -108 > > [ 266.353187] usb 4-1: cmd cmplt err -108 > > [ 266.357000] usb 4-1: cmd cmplt err -108 > > [ 266.360809] usb 4-1: cmd cmplt err -108 > > [ 266.364626] usb 4-1: cmd cmplt err -108 > > [ 266.368439] usb 4-1: cmd cmplt err -108 > > [ 266.372248] usb 4-1: cmd cmplt err -108 > > [ 266.376063] usb 4-1: cmd cmplt err -108 > > [ 266.379876] usb 4-1: cmd cmplt err -108 > > [ 266.383688] usb 4-1: cmd cmplt err -108 > > [ 266.387500] usb 4-1: cmd cmplt err -108 > > [ 266.391314] usb 4-1: cmd cmplt err -108 > > [ 266.395127] usb 4-1: cmd cmplt err -108 > > [ 266.398943] usb 4-1: cmd cmplt err -108 > > [ 266.402753] usb 4-1: cmd cmplt err -108 > > [ 266.406565] usb 4-1: cmd cmplt err -108 > > [ 266.410379] usb 4-1: cmd cmplt err -108 > > [ 266.414165] usb 4-1: cmd cmplt err -108 > > [ 266.418003] usb 4-1: cmd cmplt err -108 > > [ 266.448629] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd > > 1, flush 0, corrupt 0, gen 0 > > < more FS errors follow > > > > > The OS is then unable to recover (I have rootfs on that SSD too) and > > the board must be manually restarted. > > > > I can reproduce the problem on mainline 6.4-rc6 with multi_v7_defconfig > > (+ CONFIG_BTRFS=y for the rootfs). I've bisected the problem a while > > back and the first broken commit is b138e23d3dff ("usb: dwc3: core: > > Enable AutoRetry feature in the controller"). Reverting this commit > > locally makes my board stable again (FIO test above can run > > for >10 minutes without any issues). > > This info helps a lot. > > > > > The crash is happening when the USB-SATA bridge is controlled by the > > uas driver. I have not tested the usb-storage driver yet. > > > > What do you think would be an appropriate fix here? One idea I had is > > to add a Odroid-specific quirk to DWC3 to not enable AutoRetry here. > > However, I'm not entirely sure this is isolated to Odroid boards. > > > > Please let me know if you need me to do some more experiments. > > > > This failure indicates that whichever device you're testing against > could not retry with burst (NumP != 0) after a CRC error. After a period > of time, the host timed out and attempted to restore its operations by > stoping the active transfers with a Stop-ep command. However, for some > reason, the host doesn't respond to this command. The crash you observed > is probably a separate issue. The main issue is why the host doesn't > receive a command completion event. If you're our direct customer, you > can submit a STAR request for our support. I'm not aware of this type of > failure related to AutoRetry. However, given how old this controller > version is (over a decade ago), I can't be sure. Thank you, this explanations makes sense to me. > I think if you try to test against a different device, you may not > observe this same failure. I can partially confirm this. There is a USB3 to 1Gbit Ethernet bridge onboard too and this peripheral appears to work reliably. I am unable to test a different USB-to-SATA bridge though - there are no physical USB3 ports on Odroid HC2. It would be possible to verify this on Odroid XU4 which uses the same chip and does have physical USB ports. However, I don't have one at hand now. > To resolve this, please look into our support team to investigate > further to see whether it's a setup issue. Otherwise, we can disable > this feature for dwc_usb3 v2.00a. Depending on how bad the CRC error > rate is (which should be low), this should not affect performance > much. I unfortunately have no relationship with either Synopsys, Samsung or Hardkernel. Would you be OK with me submitting the proposed patch even without further investigation? Also, can I submit this for backporting to -stable? > I don't think this neccessarily needs a new DT property. > > Something like this: > > diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c > index 0beaab932e7d..1bfd8b127240 100644 > --- a/drivers/usb/dwc3/core.c > +++ b/drivers/usb/dwc3/core.c > @@ -1209,8 +1209,9 @@ static int dwc3_core_init(struct dwc3 *dwc) > dwc3_writel(dwc->regs, DWC3_GUCTL1, reg); > } > > - if (dwc->dr_mode == USB_DR_MODE_HOST || > - dwc->dr_mode == USB_DR_MODE_OTG) { > + if (!DWC3_VER_IS(DWC3, 200A) && > + (dwc->dr_mode == USB_DR_MODE_HOST || > + dwc->dr_mode == USB_DR_MODE_OTG)) { > reg = dwc3_readl(dwc->regs, DWC3_GUCTL); > > /* > > Thanks, > Thinh Thank you, Jakub