Re: Bus noise periodically causes ci_hdrc IRQ lockup

Chandler Griscom <cjgriscom@xxxxxxxxx> · Thu, 28 Feb 2019 23:22:44 -0500

On 2/28/19 1:57 AM, Peter Chen wrote:

Let me summary your observation:
- bind/unbind ci_hdrc device can recover connection
- Reset HUB can't recover, and will go the previous error state after
reset

  From the register, we do see something abnormal, and the RX is
waiting the SYNC Field. We need to see the dp/dm status to know if HUB
is wrong, eg, sending data exceed 20us (larger than 1024 bytes)

I will continue looking into probing Dm/Dp.  You would like me to do
this
*while* the failure occurs, or after?

After the error occurs.

Peter

Hi Peter,

That summary is accurate.

I soldered some leads onto the host/hub connection and hooked up to oscilloscope.
The findings were interesting:

Directly after failure:
    0x020CA060 @ 0x2: 0x00000064
    PORTSC reg: 18001a05
    Dm line: 150 mV
    Dp line: 0 V
After failure, hub reset asserted:
    0x020CA060 @ 0x2: 0x00000024
    PORTSC reg: 18001205
    Dm line: 0 V
    Dp line: 0 V
After failure, hub reset released:
    0x020CA060 @ 0x2: 0x00000064
    PORTSC reg: 18001a05
    Dm line: 150 mV
    Dp line: 0 V

It seems strange that it would switch between those voltages -- could the hub and
host be trying to write different values at the same time?

HUB and host are impossible to send the data together.

I have noticed something new happening (maybe as a result of hooking up the
probe?).
A couple times now after initial failure, the device has changed states later.
In this state the Linux USB devices appear to 'wake up' and start throwing errors.

     (failure occurs)
[  227.323636] smsc95xx 1-1.4.1:1.0 eth1: Failed to read reg index 0x00000114: -
110 [  227.323659] smsc95xx 1-1.4.1:1.0 eth1: Error reading MII_ACCESS
[  227.323677] smsc95xx 1-1.4.1:1.0 eth1: MII is busy in smsc95xx_mdio_read
[  227.323694] smsc95xx 1-1.4.1:1.0 eth1: Failed to read MII_BMSR
     (no errors for 25 minutes, then something changes) [ 1752.092896] uvcvideo:
Non-zero status (-71) in video completion handler.
[ 1752.124744] uvcvideo: Non-zero status (-71) in video completion handler.
[ 1752.124866] usb 1-1.4: clear tt 3 (91c1) error -71
     ...lots of errors...

Registers:
0x020CA060 @ 0x1: 0x0000FFFF
0x020CA060 @ 0x2: 0x00140060
0x020CA060 @ 0x3: 0x10801110
0x020CA060 @ 0x4: 0x00010001
0x020CA060 @ 0x5: 0x01011101
0x020CA060 @ 0x6: 0x00000101
0x020CA060 @ 0x7: 0x06200010 (changing)
0x020CA060 @ 0x8: 0x11000001
    PORTSC reg: steady at 10001801
    Dm line: steady at 3 V
    Dp line: steady at 0 V

And after that, when I reset the hub it returns to normal operation.

    ...lots of errors...
[ 1977.285844] usb 1-1.4: clear tt 3 (91c1) error -71
     (hub reset asserted)
[ 1977.309718] usb 1-1: USB disconnect, device number 2
     (hub reset released)
[ 2088.226453] usb 1-1: new high-speed USB device number 29 using ci_hdrc

Let me know what you think of this,

It seems you record DM/DP opposite, please confirm it.
I checked and yes, they were opposite.
Besides,
- Do you observe this USB issue at specific board or some boards?

I have observed it in two different boards in this specific setup, and 
in two other machines in the field (not in my possession).

- After connecting probe, sometimes the reset HUB can recover, and somethings
can't?
It looks like attaching the probe makes the data lines much more 
sensitive to interference.  When I run the welder again after the 
initial failure, the state changes as shown in the 2nd earlier log. I 
have reproduced this again, see below link - log #5.
- When HUB's reset is asserted, does the register dump and measure are like below (can't recover situation):
0x020CA060 @ 0x1: 0x00007B2C
0x020CA060 @ 0x2: 0x00000024
0x020CA060 @ 0x3: 0x108401C0 (still changes on every read)
0x020CA060 @ 0x4: 0x00010001
0x020CA060 @ 0x5: 0x01011101
0x020CA060 @ 0x6: 0x00000101
0x020CA060 @ 0x7: 0x05300010
0x020CA060 @ 0x8: 0x81000001
PORTSC reg: 18001205
Dm line: 0 V
Dp line: 0 V
That's correct.
Besides, I need your whole kernel log with and without using probe, your original kernel
log at github is ok (but need to let me access). I need to know if bus reset and bus suspend
occur during the whole process.

My guesses are:
First log: the HUB enters FS with unknown reason, it adds its 1.5 Kohm (minimum can be 900ohm) @3.3V,
and the host is 45ohm, so the host sees it is ~150mV. The host controller is stuck at HS rxactive state
at this situation forever.
Second log: the host controller is ok; the hub is wrong. The HUB can't be recovered by bus reset, only hardware
reset HUB can be recovered.

Next step:
- Try USBPHYx_RXn.ENVADJ (0x20ca020) as 0x3 and 0x1, see if something changes.
- When the error occurs, what code will run and what log will show, we could try the recovery
method.

Peter

I created a new log today that should be easier to follow. It is
on my github here, hopefully you can access it.
https://github.com/cjgriscom/ci-hrdc-logs/tree/master/28Feb2019_1

I rarely get any USB errors showing up in dmesg after failure, usually just
4 lines from smsc95xx.  But I included a full kern.log at the link.

I tried the ENVADJ changes and they did not seem to have any effect.
Log 4 shows where I tried modifying it after failure; nothing different happens.
In Log 7 I set ENVADJ=0x3 and triggered the failure again.  The states are attached to
log but it looks the same to me.

Chandler