RE: Bus noise periodically causes ci_hdrc IRQ lockup

Peter Chen <peter.chen@xxxxxxx> · Mon, 25 Feb 2019 02:58:15 +0000

> On 2/23/19 3:17 AM, Greg KH wrote:
> > On Fri, Feb 22, 2019 at 10:43:17AM -0500, Chandler Griscom wrote:
> >> Hello,
> >>
> >> I am encountering an issue where noise on USB devices is causing the
> >> host ci_hdrc driver to stall.  The system contains an i.MX6 board
> >> (UDOO) connected to a USB touchscreen, SMSC95xx hub, an FTDI device,
> >> and a hi-speed camera.
> >>
> >> Occasionally (after hours or days), or in a noisy environment, all
> >> the devices on the root hub stop working.  They show up in debugfs,
> >> lsusb, etc, but any attempt to communicate with them or reset through
> >> /sys/bus/usb times out with error -110 or -71.
> >>
> >> dmesg, ci_hdrc debugfs entries, and lsusb -v are posted here:
> >> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
> >> st.github.com%2Fcjgriscom%2F5238df9fbf7ffc4f558b37b5883f8398&amp;data
> >>
> =02%7C01%7CPeter.Chen%40nxp.com%7C8cda33ec63014e36021a08d699d6608
> e%7C
> >>
> 686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636865543356102481&amp
> ;sda
> >>
> ta=GrnkJqiDD%2FP2KyFlv8tLdaX0WMZ15QRO%2BNgWbRhQCt4%3D&amp;reser
> ved=0
> >>

I am afraid I can't access above, I can only access:

https://github.com/cjgriscom?tab=overview&from=2019-01-01&to=2019-01-31

> >> Performing a bind/unbind on ci_hdrc with the following commands results in a
> successful reset:
> >>   # echo "ci_hdrc.0" > /sys/bus/platform/drivers/ci_hdrc/unbind
> >>   # echo "ci_hdrc.0" > /sys/bus/platform/drivers/ci_hdrc/bind
> >>
> >> The issue seems to strongly correlate with a large error count in the
> >> IRQ counter in /sys/kernel/debug/usb/ehci/ci_hdrc.0/registers,
> >> whereas under normal operation the count is very low:
> >>    irq normal 1031800 err 199069 iaa 17040 (lost 0) After the lockup,
> >> interrupts appear to stop firing as the count stops incrementing.
> >>
> >> I have not yet found a way to reproduce the error outside of the
> >> machine where it occurs.  Swapping hardware has not made a difference.
> >> I have tried artificially inducing bit errors by manipulating the
> >> data lines of one of the attached USB ports, and while this creates a
> >> large number of errors, the bus is able to recover once it returns to
> >> normal operation.  The most reliable way that I have used to
> >> reproduce the failure locally is to run a welder nearby, and the
> >> driver usually fails within minutes.
> > This sentence is the best thing I have read in a bug report in a very
> > long time, thank you for it. :)
> >
> > Yes, noisy electrical things can cause bad problems, the ability for
> > some hardware to properly recover from those issues is not always the
> > same.
> >
> > Peter is the maintainer for this driver, he would know best as he has
> > access to the hardware data sheets for this chip, and can test things
> > out.  Maybe he even has access to a good arc welder...
> >
> > Peter, any ideas?
> >

I suspect the controller is stuck at high speed. Chandler, would you please supply below information:
- If there is SOFs on the bus (you need to measure by probe) when the issue occurs?
- If the SOF can't be observed at bus, it means the disconnection can't be observed either. You could
check the portsc.ccs bit if you could disconnect the HUB by some ways.

No software workaround now, we still don't know the root cause. Would you please dump below
registers at usbphy when the issue occurs:

for (write USBPHY.USBPHYx_DEBUG1n($USBPHY + 0x70) from 1 to 7)
	read USBPHY.USBPHYx_DEBUG0_STATUS three times, and delay 10ms between
	each read

Record the above 6 values, and tell me, thanks.

Peter

> >> I have seen the failure occur on the following kernels:
> >> 3.14
> >> 4.15.7
> >> 4.18.20
> >> 4.20.6
> >> 5.0-r7
> >>
> >> Similar reports:
> >> This old bug report at NXP seems to describe the same issue:
> >> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fco
> >>
> mmunity.nxp.com%2Fthread%2F355151&amp;data=02%7C01%7CPeter.Chen%40
> nxp
> >> .com%7C8cda33ec63014e36021a08d699d6608e%7C686ea1d3bc2b4c6fa92cd9
> 9c5c3
> >>
> 01635%7C0%7C0%7C636865543356102481&amp;sdata=sd%2Btpc1SRCHc7v5G
> YJMCnh
> >> xHaiaOMSkx06EP3s2jHcA%3D&amp;reserved=0
> >> A similar issue seems to have been fixed in the dwc_otg driver:
> >> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
> >>
> thub.com%2Fraspberrypi%2Flinux%2Fissues%2F552&amp;data=02%7C01%7CPet
> e
> >>
> r.Chen%40nxp.com%7C8cda33ec63014e36021a08d699d6608e%7C686ea1d3bc2b
> 4c6
> >>
> fa92cd99c5c301635%7C0%7C0%7C636865543356102481&amp;sdata=ZOV%2FR
> M3BDV
> >> v8G5FCZxNj7u%2FUCZ9AY9s17zJONb28fDc%3D&amp;reserved=0
> > That's interesting, I don't see where that bug was fixed in that issue
> > report, just that it was "resolved" in a newer update.  Trying to
> > figure out what the actual commit might be helpful.
> >
> > thanks,
> >
> > greg k-h
> 
> I'm glad you found the welder observation amusing; I was quite surprised myself :)
> 
> I dug through the old rpi tree to find P33M's commits, it looks like it might be these:
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com
> %2Fraspberrypi%2Flinux%2Fcommit%2F061ccf4d40f3ec9ba76e80d3e672630c53c
> d776b&amp;data=02%7C01%7CPeter.Chen%40nxp.com%7C8cda33ec63014e360
> 21a08d699d6608e%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C6368
> 65543356102481&amp;sdata=D5J9IQm1lZVRrfRxE6CQepkOaHuWjkC2W2Bs7eZI
> uuY%3D&amp;reserved=0
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com
> %2Fraspberrypi%2Flinux%2Fcommit%2Fb09a27249d61475e4423607f7632a5aa6e
> 7b3a53&amp;data=02%7C01%7CPeter.Chen%40nxp.com%7C8cda33ec63014e36
> 021a08d699d6608e%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636
> 865543356102481&amp;sdata=3RasfOVh1cwgiXyvbiCO%2B4knJg8jPz1TD10C9yV
> YX8A%3D&amp;reserved=0
> They both appear to be related to missing interrupts in rare cases.
> 
> Yesterday I found this repeated message by enabling dynamic debugging for the
> chipidea and ehci modules; it might reveal something about where it gets stuck:
> 
>   ci_hdrc ci_hdrc.0: IAA watchdog: status ce088 cmd 10075 That seems to line up
> with what's found in debugfs ehci registers: status ce088 PPCE Async Periodic Recl
> FLR command 0010075 (park)=0 ithresh=1 IAAD Async Periodic period=512 RUN
> Chandler
>