Re: [BUG] "HC died; cleaning up" and have to reboot

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi

Thank to try to help me.
Responses below.


> From: "Mathias Nyman" <mathias.nyman@xxxxxxxxxxxxxxx>
> 
> Hi
> 
> On 24.02.2017 17:01, shal@xxxxxxx wrote:
> > Hello,
> >
> > I have a BUG on USB xhci.
> > The trace here :
> > [11518.982950] xhci_hcd 0000:07:00.0: Stopped the command ring
> > failed, maybe the host is dead
> > [11519.027106] xhci_hcd 0000:07:00.0: Host halt failed, -110
> > [11519.027108] xhci_hcd 0000:07:00.0: Abort command ring failed
> > [11519.027215] xhci_hcd 0000:07:00.0: HC died; cleaning up
> > [11519.027230] xhci_hcd 0000:07:00.0: Timeout while waiting for
> > setup device command
> > [11519.442303] usb 3-1: device not accepting address 15, error -108
> > [11519.442324] usb usb3-port1: couldn't allocate usb_device
> >
> > After this error happens, I have to reboot Linux. Without reboot
> > the USB port doesn't work for any devices.
> 
> We're waiting for the device to respond to a setup  device. It
> doesn't respond, so we have to cancel the command.
> (stop the command ring, skip the command, and restart the command
> ring)
> 
> We first fail in stopping the command ring, then we fail in halting
> the entire host controller.


OK, I have the same conclusions

> 
> >
> >
> > The situation.
> > uname -a :
> > Linux shal 4.10.0-8-generic #10-Ubuntu SMP Mon Feb 13 14:04:59 UTC
> > 2017 x86_64 x86_64 x86_64 GNU/Linux
> 
> 4.10 contains changes in exactly this area to prevent a race that
> might re-start the command we check if it stopped
> 
> Do you have an older kernel available to check if its a regression in
> 4.10?

I have tested also with a 4.9.0-15-generic #16-Ubuntu SMP. 
Messages are not exactly the same but globally it is the same error

With the 4.9 (and active xhci debug)
[   55.015537] xhci_hcd 0000:07:00.0: Command timeout
[   55.015540] xhci_hcd 0000:07:00.0: Abort command ring
[   76.729790] xhci_hcd 0000:07:00.0: Stopped the command ring failed, maybe the host is dead
[   76.729798] xhci_hcd 0000:07:00.0: // Halt the HC
[   76.772864] xhci_hcd 0000:07:00.0: Host not halted after 16000 microseconds.
[   76.772865] xhci_hcd 0000:07:00.0: Abort command ring failed
[   76.772917] xhci_hcd 0000:07:00.0: HC died; cleaning up
[   76.772926] xhci_hcd 0000:07:00.0: xHCI host controller is dead.
[   76.772957] xhci_hcd 0000:07:00.0: Timeout while waiting for setup device command
[   77.188037] usb 3-1: device not accepting address 2, error -108
[   77.188049] xhci_hcd 0000:07:00.0: Endpoint 0x0 ep reset callback called
[   77.188063] xhci_hcd 0000:07:00.0: xHCI dying or halted, can't queue_command
[   77.188064] xhci_hcd 0000:07:00.0: FIXME: allocate a command ring segment
[   77.188066] usb usb3-port1: couldn't allocate usb_device

The FIXME can perhaps help you.
See also at the end of email for full trace on 4.10 kernel


> 
> Do you have a host from another vendor to try this on?
> Log show that host controller becomes really unresponsive after we
> try to abort the command ring.

I have no other PC but if needed I can retrieve one (or a arm board).
But see below the ehci test.


> >
> > I work with an old Android smartphone in fastboot mode. The
> > smartphone is connected with a long USB cable (5m).
> > In fastboot mode (and only with this mode), the devices is not
> > reachable .
> > There is error like this :
> > usb 3-1: device not accepting address 12, error -71
> >
> > So, I had "usbcore.old_scheme_first=1" in kernel command option and
> > then I can reach the device in fastboot mode.
> >
> > But I performs some operation on the smartphone and sometime the
> > device hung .
> 
> Does the host always hang after a command times out?, i.e is there
> ever a timeout message:
> "xhci_hcd 0000:07:00.0: Timeout while ..." without the host dying
> messages:
> 
> xhci_hcd 0000:07:00.0: Stopped the command ring failed, maybe the
> host is dead
> xhci_hcd 0000:07:00.0: Host halt failed, -110
> xhci_hcd 0000:07:00.0: Abort command ring failed
> xhci_hcd 0000:07:00.0: HC died; cleaning up


Yes, the error always happens  Sometime traces are not the same but always :
- xhci_handle_command_timeout
- HC died; cleaning up
- usb usb3-port1: couldn't allocate usb_device



> You could try if the EHCI usb controller works.
> 00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series
> Chipset Family USB Enhanced Host Controller

On this EHCI USB controler : 
[   13.900683] usb 2-1.4: new high-speed USB device number 5 using ehci-pci
[   19.332599] usb 2-1.4: device not accepting address 5, error -71
[   19.412611] usb 2-1.4: new high-speed USB device number 6 using ehci-pci
[   24.960558] usb 2-1.4: device not accepting address 6, error -71
[   25.040533] usb 2-1.4: new high-speed USB device number 7 using ehci-pci
[   30.252452] usb 2-1.4: device descriptor read/64, error -110
[   35.636399] usb 2-1.4: device descriptor read/64, error -71
[   35.819488] usb 2-1.4: new high-speed USB device number 8 using ehci-pci
[   40.845551] usb 2-1.4: device descriptor read/64, error -110
[   46.141461] usb 2-1.4: device descriptor read/64, error -71
[   46.248291] usb 2-1-port4: unable to enumerate USB device


I recall that the device (smartphone) is perhaps not worked well (I try to develop some low level function in fastboot mode).
Some having some error is perhaps normal and in the ECHI case the error seems to be correctly managed.
I can perform usb packet capture with wireshark if needed.


> >
> > More traces:
> > [11466.611552] usb 3-1: USB disconnect, device number 11
> 
> sudden disconnect
> 
> > [11468.957608] usb 3-1: new high-speed USB device number 12 using
> > xhci_hcd
> > [11470.878811] usb 3-1: Device not responding to setup address.
> > [11486.881738] usb 3-1: Device not responding to setup address.
> 
> So there are already a couple transaction errors when trying to
> address the device
> 
> > [11487.088447] usb 3-1: device not accepting address 12, error -71
> > [11487.532378] usb 3-1: new high-speed USB device number 14 using
> > xhci_hcd
> > [11487.559735] usb 3-1: unable to get BOS descriptor
> > [11487.564929] usb 3-1: New USB device found, idVendor=18d1,
> > idProduct=d00d
> > [11487.564932] usb 3-1: New USB device strings: Mfr=1, Product=2,
> > SerialNumber=0
> > [11487.564934] usb 3-1: Product: Android
> > [11487.564935] usb 3-1: Manufacturer: Google
> > [11489.585534] usb 3-1: USB disconnect, device number 14
> 
> sudden disconnect
> 
> > [11491.748090] usb 3-1: new high-speed USB device number 15 using
> > xhci_hcd
> > [11518.982950] xhci_hcd 0000:07:00.0: Stopped the command ring
> > failed, maybe the host is dead
> > [11519.027106] xhci_hcd 0000:07:00.0: Host halt failed, -110
> > [11519.027108] xhci_hcd 0000:07:00.0: Abort command ring failed
> > [11519.027215] xhci_hcd 0000:07:00.0: HC died; cleaning up
> > [11519.027230] xhci_hcd 0000:07:00.0: Timeout while waiting for
> > setup device command
> > [11519.442303] usb 3-1: device not accepting address 15, error -108
> > [11519.442324] usb usb3-port1: couldn't allocate usb_device
> 
> Connection looks really unreliable.
> 
> Enabling xhci debugging might reveal something:
> echo -n 'module xhci_hcd =p' >
> /sys/kernel/debug/dynamic_debug/control
> 
> -Mathias


Additional trace with this xhci debugging enable on the 4.10.0 kernel :


[  325.760164] xhci_hcd 0000:07:00.0: Port Status Change Event for port 3
[  325.760169] xhci_hcd 0000:07:00.0: resume root hub
[  325.760178] xhci_hcd 0000:07:00.0: handle_port_status: starting port polling.
[  325.760233] xhci_hcd 0000:07:00.0: get port status, actual port 0 status  = 0x202e1
[  325.760253] xhci_hcd 0000:07:00.0: Get port status returned 0x10101
[  325.760279] xhci_hcd 0000:07:00.0: clear port connect change, actual port 0 status  = 0x2e1
[  325.760291] xhci_hcd 0000:07:00.0: get port status, actual port 1 status  = 0x2a0
[  325.760294] xhci_hcd 0000:07:00.0: Get port status returned 0x100
[  325.867287] xhci_hcd 0000:07:00.0: get port status, actual port 0 status  = 0x2e1
[  325.867290] xhci_hcd 0000:07:00.0: Get port status returned 0x101
[  325.867309] xhci_hcd 0000:07:00.0: // Ding dong!
[  325.867456] xhci_hcd 0000:07:00.0: Slot 1 output ctx = 0x35040000 (dma)
[  325.867460] xhci_hcd 0000:07:00.0: Slot 1 input ctx = 0x35041000 (dma)
[  325.867476] xhci_hcd 0000:07:00.0: Set slot id 1 dcbaa entry ffff9cbb35006008 to 0x35040000
[  325.867491] xhci_hcd 0000:07:00.0: set port reset, actual port 0 status  = 0x2e1
[  325.871230] xhci_hcd 0000:07:00.0: xhci_hub_status_data: stopping port polling.
[  325.927279] xhci_hcd 0000:07:00.0: get port status, actual port 0 status  = 0x331
[  325.927282] xhci_hcd 0000:07:00.0: Get port status returned 0x111
[  325.935523] xhci_hcd 0000:07:00.0: Port Status Change Event for port 3
[  325.935530] xhci_hcd 0000:07:00.0: handle_port_status: starting port polling.
[  325.987261] xhci_hcd 0000:07:00.0: get port status, actual port 0 status  = 0x200e03
[  325.987264] xhci_hcd 0000:07:00.0: Get port status returned 0x100503
[  325.987284] xhci_hcd 0000:07:00.0: clear port reset change, actual port 0 status  = 0xe03
[  326.047245] usb 3-1: new high-speed USB device number 2 using xhci_hcd
[  326.047253] xhci_hcd 0000:07:00.0: Set root hub portnum to 3
[  326.047254] xhci_hcd 0000:07:00.0: Set fake root hub portnum to 1
[  326.047256] xhci_hcd 0000:07:00.0: udev->tt =           (null)
[  326.047258] xhci_hcd 0000:07:00.0: udev->ttport = 0x0
[  326.047260] xhci_hcd 0000:07:00.0: Slot ID 1 Input Context:
[  326.047263] xhci_hcd 0000:07:00.0: @ffff9cbb35041000 (virt) @35041000 (dma) 0x000000 - drop flags
[  326.047265] xhci_hcd 0000:07:00.0: @ffff9cbb35041004 (virt) @35041004 (dma) 0x000003 - add flags
[  326.047267] xhci_hcd 0000:07:00.0: @ffff9cbb35041008 (virt) @35041008 (dma) 0x000000 - rsvd2[0]
[  326.047269] xhci_hcd 0000:07:00.0: @ffff9cbb3504100c (virt) @3504100c (dma) 0x000000 - rsvd2[1]
[  326.047271] xhci_hcd 0000:07:00.0: @ffff9cbb35041010 (virt) @35041010 (dma) 0x000000 - rsvd2[2]
[  326.047273] xhci_hcd 0000:07:00.0: @ffff9cbb35041014 (virt) @35041014 (dma) 0x000000 - rsvd2[3]
[  326.047274] xhci_hcd 0000:07:00.0: @ffff9cbb35041018 (virt) @35041018 (dma) 0x000000 - rsvd2[4]
[  326.047276] xhci_hcd 0000:07:00.0: @ffff9cbb3504101c (virt) @3504101c (dma) 0x000000 - rsvd2[5]
[  326.047278] xhci_hcd 0000:07:00.0: Slot Context:
[  326.047280] xhci_hcd 0000:07:00.0: @ffff9cbb35041020 (virt) @35041020 (dma) 0x8300000 - dev_info
[  326.047282] xhci_hcd 0000:07:00.0: @ffff9cbb35041024 (virt) @35041024 (dma) 0x030000 - dev_info2
[  326.047284] xhci_hcd 0000:07:00.0: @ffff9cbb35041028 (virt) @35041028 (dma) 0x000000 - tt_info
[  326.047286] xhci_hcd 0000:07:00.0: @ffff9cbb3504102c (virt) @3504102c (dma) 0x000000 - dev_state
[  326.047288] xhci_hcd 0000:07:00.0: @ffff9cbb35041030 (virt) @35041030 (dma) 0x000000 - rsvd[0]
[  326.047290] xhci_hcd 0000:07:00.0: @ffff9cbb35041034 (virt) @35041034 (dma) 0x000000 - rsvd[1]
[  326.047291] xhci_hcd 0000:07:00.0: @ffff9cbb35041038 (virt) @35041038 (dma) 0x000000 - rsvd[2]
[  326.047293] xhci_hcd 0000:07:00.0: @ffff9cbb3504103c (virt) @3504103c (dma) 0x000000 - rsvd[3]
[  326.047295] xhci_hcd 0000:07:00.0: IN Endpoint 00 Context (ep_index 00):
[  326.047297] xhci_hcd 0000:07:00.0: @ffff9cbb35041040 (virt) @35041040 (dma) 0x000000 - ep_info
[  326.047299] xhci_hcd 0000:07:00.0: @ffff9cbb35041044 (virt) @35041044 (dma) 0x400026 - ep_info2
[  326.047301] xhci_hcd 0000:07:00.0: @ffff9cbb35041048 (virt) @35041048 (dma) 0x35042001 - deq
[  326.047303] xhci_hcd 0000:07:00.0: @ffff9cbb35041050 (virt) @35041050 (dma) 0x000000 - tx_info
[  326.047305] xhci_hcd 0000:07:00.0: @ffff9cbb35041054 (virt) @35041054 (dma) 0x000000 - rsvd[0]
[  326.047306] xhci_hcd 0000:07:00.0: @ffff9cbb35041058 (virt) @35041058 (dma) 0x000000 - rsvd[1]
[  326.047308] xhci_hcd 0000:07:00.0: @ffff9cbb3504105c (virt) @3504105c (dma) 0x000000 - rsvd[2]
[  326.047310] xhci_hcd 0000:07:00.0: OUT Endpoint 01 Context (ep_index 01):
[  326.047312] xhci_hcd 0000:07:00.0: @ffff9cbb35041060 (virt) @35041060 (dma) 0x000000 - ep_info
[  326.047313] xhci_hcd 0000:07:00.0: @ffff9cbb35041064 (virt) @35041064 (dma) 0x000000 - ep_info2
[  326.047315] xhci_hcd 0000:07:00.0: @ffff9cbb35041068 (virt) @35041068 (dma) 0x000000 - deq
[  326.047317] xhci_hcd 0000:07:00.0: @ffff9cbb35041070 (virt) @35041070 (dma) 0x000000 - tx_info
[  326.047319] xhci_hcd 0000:07:00.0: @ffff9cbb35041074 (virt) @35041074 (dma) 0x000000 - rsvd[0]
[  326.047320] xhci_hcd 0000:07:00.0: @ffff9cbb35041078 (virt) @35041078 (dma) 0x000000 - rsvd[1]
[  326.047322] xhci_hcd 0000:07:00.0: @ffff9cbb3504107c (virt) @3504107c (dma) 0x000000 - rsvd[2]
[  326.047324] xhci_hcd 0000:07:00.0: IN Endpoint 01 Context (ep_index 02):
[  326.047325] xhci_hcd 0000:07:00.0: @ffff9cbb35041080 (virt) @35041080 (dma) 0x000000 - ep_info
[  326.047327] xhci_hcd 0000:07:00.0: @ffff9cbb35041084 (virt) @35041084 (dma) 0x000000 - ep_info2
[  326.047329] xhci_hcd 0000:07:00.0: @ffff9cbb35041088 (virt) @35041088 (dma) 0x000000 - deq
[  326.047331] xhci_hcd 0000:07:00.0: @ffff9cbb35041090 (virt) @35041090 (dma) 0x000000 - tx_info
[  326.047332] xhci_hcd 0000:07:00.0: @ffff9cbb35041094 (virt) @35041094 (dma) 0x000000 - rsvd[0]
[  326.047334] xhci_hcd 0000:07:00.0: @ffff9cbb35041098 (virt) @35041098 (dma) 0x000000 - rsvd[1]
[  326.047336] xhci_hcd 0000:07:00.0: @ffff9cbb3504109c (virt) @3504109c (dma) 0x000000 - rsvd[2]
[  326.047341] xhci_hcd 0000:07:00.0: // Ding dong!
[  326.163180] xhci_hcd 0000:07:00.0: xhci_hub_status_data: stopping port polling.
[  331.127027] xhci_hcd 0000:07:00.0: Command timeout
[  331.127030] xhci_hcd 0000:07:00.0: Abort command ring
[  352.801379] xhci_hcd 0000:07:00.0: Stopped the command ring failed, maybe the host is dead
[  352.801384] xhci_hcd 0000:07:00.0: // Halt the HC
[  352.844705] xhci_hcd 0000:07:00.0: Host halt failed, -110
[  352.844708] xhci_hcd 0000:07:00.0: Abort command ring failed
[  352.844720] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
[  352.844721] Modules linked in: pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) binfmt_misc eeepc_wmi asus_wmi sparse_keymap snd_ua101 snd_usb_audio snd_usbmidi_lib snd_seq_midi snd_hwdep snd_seq_midi_event snd_rawmidi snd_pcm snd_seq snd_seq_device snd_timer joydev input_leds snd soundcore intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cryptd intel_cstate intel_rapl_perf serio_raw lpc_ich mei_me mei shpchp tpm_infineon mac_hid nct6775 hwmon_vid coretemp parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic usbhid hid i915 i2c_algo_bit drm_kms_helper syscopyarea psmouse sysfillrect sysimgblt fb_sys_fops drm ahci r8169 libahci mii wmi fjes video
[  352.844752] CPU: 0 PID: 1988 Comm: kworker/0:3 Tainted: G           OE   4.10.0-8-generic #10-Ubuntu
[  352.844753] Hardware name: System manufacturer System Product Name/P8Z68-M PRO, BIOS 4003 05/09/2013
[  352.844753] Workqueue: events xhci_handle_command_timeout
[  352.844755] task: ffff9cbd13558000 task.stack: ffffbd0102ea0000
[  352.844756] RIP: 0010:delay_tsc+0x2f/0x60
[  352.844756] RSP: 0018:ffffbd0102ea3dc8 EFLAGS: 00000006
[  352.844758] RAX: 000000f2fa2c9f6c RBX: ffffbd01012a8038 RCX: 000000f2fa2c9f6c
[  352.844759] RDX: 000000f200000000 RSI: 000000f2fa2c9a50 RDI: 0000000000000a1c
[  352.844759] RBP: ffffbd0102ea3dc8 R08: 0000000000000000 R09: 00000000000003af
[  352.844760] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000000
[  352.844761] R13: 0000000000000008 R14: 0000000000276d23 R15: 000000000000000c
[  352.844762] FS:  0000000000000000(0000) GS:ffff9cbd1f200000(0000) knlGS:0000000000000000
[  352.844763] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  352.844763] CR2: 00000350a55b2008 CR3: 000000005f409000 CR4: 00000000000406f0
[  352.844764] Call Trace:
[  352.844764]  __const_udelay+0x2d/0x30
[  352.844765]  xhci_handshake+0x34/0x70
[  352.844766]  xhci_handle_command_timeout+0x25f/0x300
[  352.844767]  process_one_work+0x1fc/0x4b0
[  352.844767]  worker_thread+0x4b/0x500
[  352.844768]  kthread+0x101/0x140
[  352.844769]  ? process_one_work+0x4b0/0x4b0
[  352.844769]  ? kthread_create_on_node+0x60/0x60
[  352.844770]  ret_from_fork+0x2c/0x40
[  352.844770] Code: 90 55 48 89 e5 65 44 8b 05 b7 12 db 51 0f ae e8 0f 31 48 c1 e2 20 48 89 d6 48 09 c6 0f ae e8 0f 31 48 c1 e2 20 48 09 d0 48 89 c1 <48> 29 f1 48 39 cf 76 19 f3 90 65 8b 0d 88 12 db 51 41 39 c8 74 
[  352.844897] xhci_hcd 0000:07:00.0: HC died; cleaning up
[  352.844903] xhci_hcd 0000:07:00.0: xHCI host controller is dead.
[  352.844914] xhci_hcd 0000:07:00.0: Timeout while waiting for setup device command
[  353.257927] usb 3-1: device not accepting address 2, error -108
[  353.257939] xhci_hcd 0000:07:00.0: Endpoint 0x0 ep reset callback called
[  353.257954] xhci_hcd 0000:07:00.0: xHCI dying or halted, can't queue_command
[  353.257955] xhci_hcd 0000:07:00.0: FIXME: allocate a command ring segment
[  353.257957] usb usb3-port1: couldn't allocate usb_device


Thank!,
Olivier
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux