Re: Bus noise periodically causes ci_hdrc IRQ lockup

Chandler Griscom <cjgriscom@xxxxxxxxx> · Sat, 23 Feb 2019 16:32:08 -0500

On 2/23/19 3:17 AM, Greg KH wrote:
On Fri, Feb 22, 2019 at 10:43:17AM -0500, Chandler Griscom wrote:
Hello,

I am encountering an issue where noise on USB devices is causing the
host ci_hdrc driver to stall.  The system contains an i.MX6 board
(UDOO) connected to a USB touchscreen, SMSC95xx hub, an FTDI device,
and a hi-speed camera.

Occasionally (after hours or days), or in a noisy environment, all the
devices on the root hub stop working.  They show up in debugfs, lsusb,
etc, but any attempt to communicate with them or reset through
/sys/bus/usb times out with error -110 or -71.

dmesg, ci_hdrc debugfs entries, and lsusb -v are posted here:
https://gist.github.com/cjgriscom/5238df9fbf7ffc4f558b37b5883f8398

Performing a bind/unbind on ci_hdrc with the following commands results in a successful reset:
  # echo "ci_hdrc.0" > /sys/bus/platform/drivers/ci_hdrc/unbind
  # echo "ci_hdrc.0" > /sys/bus/platform/drivers/ci_hdrc/bind

The issue seems to strongly correlate with a large error count in the
IRQ counter in /sys/kernel/debug/usb/ehci/ci_hdrc.0/registers, whereas
under normal operation the count is very low:
   irq normal 1031800 err 199069 iaa 17040 (lost 0)
After the lockup, interrupts appear to stop firing as the count stops incrementing.

I have not yet found a way to reproduce the error outside of the
machine where it occurs.  Swapping hardware has not made a difference.
I have tried artificially inducing bit errors by manipulating the data
lines of one of the attached USB ports, and while this creates a large
number of errors, the bus is able to recover once it returns to normal
operation.  The most reliable way that I have used to reproduce the
failure locally is to run a welder nearby, and the driver usually
fails within minutes.
This sentence is the best thing I have read in a bug report in a very
long time, thank you for it. :)

Yes, noisy electrical things can cause bad problems, the ability for
some hardware to properly recover from those issues is not always the
same.

Peter is the maintainer for this driver, he would know best as he has
access to the hardware data sheets for this chip, and can test things
out.  Maybe he even has access to a good arc welder...

Peter, any ideas?

I have seen the failure occur on the following kernels:
3.14
4.15.7
4.18.20
4.20.6
5.0-r7

Similar reports:
This old bug report at NXP seems to describe the same issue: https://community.nxp.com/thread/355151
A similar issue seems to have been fixed in the dwc_otg driver: https://github.com/raspberrypi/linux/issues/552
That's interesting, I don't see where that bug was fixed in that issue
report, just that it was "resolved" in a newer update.  Trying to figure
out what the actual commit might be helpful.

thanks,

greg k-h

I'm glad you found the welder observation amusing; I was quite
surprised myself :)

I dug through the old rpi tree to find P33M's commits, it looks
like it might be these:
https://github.com/raspberrypi/linux/commit/061ccf4d40f3ec9ba76e80d3e672630c53cd776b
https://github.com/raspberrypi/linux/commit/b09a27249d61475e4423607f7632a5aa6e7b3a53
They both appear to be related to missing interrupts in rare cases.

Yesterday I found this repeated message by enabling dynamic debugging
for the chipidea and ehci modules; it might reveal something about
where it gets stuck:

 ci_hdrc ci_hdrc.0: IAA watchdog: status ce088 cmd 10075 That seems to 
line up with what's found in debugfs ehci registers: status ce088 PPCE 
Async Periodic Recl FLR command 0010075 (park)=0 ithresh=1 IAAD Async 
Periodic period=512 RUN
Chandler