RE: Bus noise periodically causes ci_hdrc IRQ lockup

Peter Chen <peter.chen@xxxxxxx> · Fri, 1 Mar 2019 09:39:56 +0000

> On 2/28/19 1:57 AM, Peter Chen wrote:
> >
> >>> Let me summary your observation:
> >>> - bind/unbind ci_hdrc device can recover connection
> >>> - Reset HUB can't recover, and will go the previous error state
> >>> after reset
> >>>
> >>>   From the register, we do see something abnormal, and the RX is
> >>> waiting the SYNC Field. We need to see the dp/dm status to know if
> >>> HUB is wrong, eg, sending data exceed 20us (larger than 1024 bytes)
> >>>
> >>>> I will continue looking into probing Dm/Dp.  You would like me to
> >>>> do this
> >>>> *while* the failure occurs, or after?
> >>>>
> >>> After the error occurs.
> >>>
> >>> Peter
> >>>
> >> Hi Peter,
> >>
> >> That summary is accurate.
> >>
> >> I soldered some leads onto the host/hub connection and hooked up to
> oscilloscope.
> >> The findings were interesting:
> >>
> >> Directly after failure:
> >>     0x020CA060 @ 0x2: 0x00000064
> >>     PORTSC reg: 18001a05
> >>     Dm line: 150 mV
> >>     Dp line: 0 V
> >> After failure, hub reset asserted:
> >>     0x020CA060 @ 0x2: 0x00000024
> >>     PORTSC reg: 18001205
> >>     Dm line: 0 V
> >>     Dp line: 0 V
> >> After failure, hub reset released:
> >>     0x020CA060 @ 0x2: 0x00000064
> >>     PORTSC reg: 18001a05
> >>     Dm line: 150 mV
> >>     Dp line: 0 V
> >>
> >> It seems strange that it would switch between those voltages -- could
> >> the hub and host be trying to write different values at the same time?
> >>
> > HUB and host are impossible to send the data together.
> >
> >> I have noticed something new happening (maybe as a result of hooking
> >> up the probe?).
> >> A couple times now after initial failure, the device has changed states later.
> >> In this state the Linux USB devices appear to 'wake up' and start throwing errors.
> >>
> >>      (failure occurs)
> >> [  227.323636] smsc95xx 1-1.4.1:1.0 eth1: Failed to read reg index
> >> 0x00000114: -
> >> 110 [  227.323659] smsc95xx 1-1.4.1:1.0 eth1: Error reading
> >> MII_ACCESS [  227.323677] smsc95xx 1-1.4.1:1.0 eth1: MII is busy in
> >> smsc95xx_mdio_read [  227.323694] smsc95xx 1-1.4.1:1.0 eth1: Failed to read
> MII_BMSR
> >>      (no errors for 25 minutes, then something changes) [ 1752.092896] uvcvideo:
> >> Non-zero status (-71) in video completion handler.
> >> [ 1752.124744] uvcvideo: Non-zero status (-71) in video completion handler.
> >> [ 1752.124866] usb 1-1.4: clear tt 3 (91c1) error -71
> >>      ...lots of errors...
> >>
> >> Registers:
> >> 0x020CA060 @ 0x1: 0x0000FFFF
> >> 0x020CA060 @ 0x2: 0x00140060
> >> 0x020CA060 @ 0x3: 0x10801110
> >> 0x020CA060 @ 0x4: 0x00010001
> >> 0x020CA060 @ 0x5: 0x01011101
> >> 0x020CA060 @ 0x6: 0x00000101
> >> 0x020CA060 @ 0x7: 0x06200010 (changing)
> >> 0x020CA060 @ 0x8: 0x11000001
> >>     PORTSC reg: steady at 10001801
> >>     Dm line: steady at 3 V
> >>     Dp line: steady at 0 V
> >>
> >> And after that, when I reset the hub it returns to normal operation.
> >>
> >>     ...lots of errors...
> >> [ 1977.285844] usb 1-1.4: clear tt 3 (91c1) error -71
> >>      (hub reset asserted)
> >> [ 1977.309718] usb 1-1: USB disconnect, device number 2
> >>      (hub reset released)
> >> [ 2088.226453] usb 1-1: new high-speed USB device number 29 using
> >> ci_hdrc
> >>
> >> Let me know what you think of this,
> >>
> > It seems you record DM/DP opposite, please confirm it.
> I checked and yes, they were opposite.
> > Besides,
> > - Do you observe this USB issue at specific board or some boards?
> 
> I have observed it in two different boards in this specific setup, and in two other
> machines in the field (not in my possession).
> 
> > - After connecting probe, sometimes the reset HUB can recover, and
> > somethings can't?
> It looks like attaching the probe makes the data lines much more sensitive to
> interference.  When I run the welder again after the initial failure, the state changes
> as shown in the 2nd earlier log. I have reproduced this again, see below link - log #5.
> > - When HUB's reset is asserted, does the register dump and measure are like
> below (can't recover situation):
> > 0x020CA060 @ 0x1: 0x00007B2C
> > 0x020CA060 @ 0x2: 0x00000024
> > 0x020CA060 @ 0x3: 0x108401C0 (still changes on every read)
> > 0x020CA060 @ 0x4: 0x00010001
> > 0x020CA060 @ 0x5: 0x01011101
> > 0x020CA060 @ 0x6: 0x00000101
> > 0x020CA060 @ 0x7: 0x05300010
> > 0x020CA060 @ 0x8: 0x81000001
> > PORTSC reg: 18001205
> > Dm line: 0 V
> > Dp line: 0 V
> That's correct.
> > Besides, I need your whole kernel log with and without using probe,
> > your original kernel log at github is ok (but need to let me access).
> > I need to know if bus reset and bus suspend occur during the whole process.
> >
> > My guesses are:
> > First log: the HUB enters FS with unknown reason, it adds its 1.5 Kohm
> > (minimum can be 900ohm) @3.3V, and the host is 45ohm, so the host sees
> > it is ~150mV. The host controller is stuck at HS rxactive state at this situation
> forever.
> > Second log: the host controller is ok; the hub is wrong. The HUB can't
> > be recovered by bus reset, only hardware reset HUB can be recovered.
> >
> > Next step:
> > - Try USBPHYx_RXn.ENVADJ (0x20ca020) as 0x3 and 0x1, see if something
> changes.
> > - When the error occurs, what code will run and what log will show, we
> > could try the recovery method.
> >
> > Peter
> 
> I created a new log today that should be easier to follow. It is on my github here,
> hopefully you can access it.
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com
> %2Fcjgriscom%2Fci-hrdc-
> logs%2Ftree%2Fmaster%2F28Feb2019_1&amp;data=02%7C01%7Cpeter.chen%4
> 0nxp.com%7Cf31c6af918e64db0b20d08d69dfd89ae%7C686ea1d3bc2b4c6fa92cd9
> 9c5c301635%7C0%7C0%7C636870109591759682&amp;sdata=GwA%2B8REd3pL
> HbVFYVneOcRXYOMHvVkt8DoP3Lr9yzPY%3D&amp;reserved=0
> 
> I rarely get any USB errors showing up in dmesg after failure, usually just
> 4 lines from smsc95xx.  But I included a full kern.log at the link.
> 
> I tried the ENVADJ changes and they did not seem to have any effect.
> Log 4 shows where I tried modifying it after failure; nothing different happens.
> In Log 7 I set ENVADJ=0x3 and triggered the failure again.  The states are
> attached to log but it looks the same to me.
> 

I can access the log, there are lots of errors during the USB bus, is it done by purpose
to speed up reproduce issue? If not, the signal quality at the bus may be too poor.

We have issues reported by two SMSC hub connected, in your case, there are two HUBs
on the bus. Could this be reproduced by one HUB, eg, cut the HUB on usb 1-1.4

https://community.nxp.com/message/807027

Could you see SoF before DP changes to 3V? If it is, that's the reason why the controller
can back to normal? Does the register USBC_n_FRINDEX keeps changing during the error?
If it is, the controller tries to send SoF all the time, maybe some bus conditions (dp/dm) block
its sending.

I am not familiar with IC internal FSM, send the log to IC engineers and wait their response.
To fix this problem quickly, you may post this issue to NXP FAE, or try to improve signal quality
or replace hub.

Peter