RE: Bus noise periodically causes ci_hdrc IRQ lockup

Peter Chen <peter.chen@xxxxxxx> · Thu, 28 Feb 2019 06:57:02 +0000

> > Let me summary your observation:
> > - bind/unbind ci_hdrc device can recover connection
> > - Reset HUB can't recover, and will go the previous error state after
> > reset
> >
> >  From the register, we do see something abnormal, and the RX is
> > waiting the SYNC Field. We need to see the dp/dm status to know if HUB
> > is wrong, eg, sending data exceed 20us (larger than 1024 bytes)
> >
> >> I will continue looking into probing Dm/Dp.  You would like me to do
> >> this
> >> *while* the failure occurs, or after?
> >>
> > After the error occurs.
> >
> > Peter
> >
> Hi Peter,
> 
> That summary is accurate.
> 
> I soldered some leads onto the host/hub connection and hooked up to oscilloscope.
> The findings were interesting:
> 
> Directly after failure:
>    0x020CA060 @ 0x2: 0x00000064
>    PORTSC reg: 18001a05
>    Dm line: 150 mV
>    Dp line: 0 V
> After failure, hub reset asserted:
>    0x020CA060 @ 0x2: 0x00000024
>    PORTSC reg: 18001205
>    Dm line: 0 V
>    Dp line: 0 V
> After failure, hub reset released:
>    0x020CA060 @ 0x2: 0x00000064
>    PORTSC reg: 18001a05
>    Dm line: 150 mV
>    Dp line: 0 V
> 
> It seems strange that it would switch between those voltages -- could the hub and
> host be trying to write different values at the same time?
> 

HUB and host are impossible to send the data together.

> I have noticed something new happening (maybe as a result of hooking up the
> probe?).
> A couple times now after initial failure, the device has changed states later.
> In this state the Linux USB devices appear to 'wake up' and start throwing errors.
> 
>     (failure occurs)
> [  227.323636] smsc95xx 1-1.4.1:1.0 eth1: Failed to read reg index 0x00000114: -
> 110 [  227.323659] smsc95xx 1-1.4.1:1.0 eth1: Error reading MII_ACCESS
> [  227.323677] smsc95xx 1-1.4.1:1.0 eth1: MII is busy in smsc95xx_mdio_read
> [  227.323694] smsc95xx 1-1.4.1:1.0 eth1: Failed to read MII_BMSR
>     (no errors for 25 minutes, then something changes) [ 1752.092896] uvcvideo:
> Non-zero status (-71) in video completion handler.
> [ 1752.124744] uvcvideo: Non-zero status (-71) in video completion handler.
> [ 1752.124866] usb 1-1.4: clear tt 3 (91c1) error -71
>     ...lots of errors...
> 
> Registers:
> 0x020CA060 @ 0x1: 0x0000FFFF
> 0x020CA060 @ 0x2: 0x00140060
> 0x020CA060 @ 0x3: 0x10801110
> 0x020CA060 @ 0x4: 0x00010001
> 0x020CA060 @ 0x5: 0x01011101
> 0x020CA060 @ 0x6: 0x00000101
> 0x020CA060 @ 0x7: 0x06200010 (changing)
> 0x020CA060 @ 0x8: 0x11000001
>    PORTSC reg: steady at 10001801
>    Dm line: steady at 3 V
>    Dp line: steady at 0 V
> 
> And after that, when I reset the hub it returns to normal operation.
> 
>    ...lots of errors...
> [ 1977.285844] usb 1-1.4: clear tt 3 (91c1) error -71
>     (hub reset asserted)
> [ 1977.309718] usb 1-1: USB disconnect, device number 2
>     (hub reset released)
> [ 2088.226453] usb 1-1: new high-speed USB device number 29 using ci_hdrc
> 
> Let me know what you think of this,
> 

It seems you record DM/DP opposite, please confirm it.
Besides,
- Do you observe this USB issue at specific board or some boards?
- After connecting probe, sometimes the reset HUB can recover, and somethings
can't?
- When HUB's reset is asserted, does the register dump and measure are like below (can't recover situation):
0x020CA060 @ 0x1: 0x00007B2C
0x020CA060 @ 0x2: 0x00000024
0x020CA060 @ 0x3: 0x108401C0 (still changes on every read)
0x020CA060 @ 0x4: 0x00010001
0x020CA060 @ 0x5: 0x01011101
0x020CA060 @ 0x6: 0x00000101
0x020CA060 @ 0x7: 0x05300010
0x020CA060 @ 0x8: 0x81000001
PORTSC reg: 18001205
Dm line: 0 V
Dp line: 0 V

Besides, I need your whole kernel log with and without using probe, your original kernel
log at github is ok (but need to let me access). I need to know if bus reset and bus suspend
occur during the whole process.

My guesses are:
First log: the HUB enters FS with unknown reason, it adds its 1.5 Kohm (minimum can be 900ohm) @3.3V,
and the host is 45ohm, so the host sees it is ~150mV. The host controller is stuck at HS rxactive state
at this situation forever.
Second log: the host controller is ok; the hub is wrong. The HUB can't be recovered by bus reset, only hardware
reset HUB can be recovered.

Next step:
- Try USBPHYx_RXn.ENVADJ (0x20ca020) as 0x3 and 0x1, see if something changes.
- When the error occurs, what code will run and what log will show, we could try the recovery
method.

Peter