Re: NEC uPD720200 xHCI Controller dies when Runtime PM enabled

Mathias Nyman <mathias.nyman@xxxxxxxxx> · Thu, 18 Feb 2016 17:12:35 +0200




On 16.02.2016 23:58, main.haarp@xxxxxxxxxxxxxx wrote:


On 2016-02-08 15:31, Mathias Nyman wrote:
Hi

On 06.02.2016 19:08, Mike Murdoch wrote:
Bug ID: 111251

Hello,

I have a NEC uPD720200 USB3.0 controller in a Thinkpad W520 laptop on
kernel 4.4.1-gentoo.

0e:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host
Controller (rev 04) (prog-if 30 [XHCI])
      Subsystem: Lenovo uPD720200 USB 3.0 Host Controller

When runtime power control for this controller is disabled
(/sys/bus/pci/devices/0000:0e:00.0/power/control = on), the controller
works fine and reaches over 120MB/s transfer rates.

When runtime power control for this controller is enabled
(/sys/bus/pci/devices/0000:0e:00.0/power/control = auto), two effects
can be observed:

- Transfer rates are much lower at around 30MB/s
- During transfers, the controller dies after a couple of seconds:

xhci_hcd 0000:0e:00.0: xHCI host not responding to stop endpoint
command.
xhci_hcd 0000:0e:00.0: Assuming host is dying, halting host.
xhci_hcd 0000:0e:00.0: Host not halted after 16000 microseconds.
xhci_hcd 0000:0e:00.0: Non-responsive xHCI host is not halting.
xhci_hcd 0000:0e:00.0: Completing active URBs anyway.
xhci_hcd 0000:0e:00.0: HC died; cleaning up
sd 9:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_ERROR
driverbyte=DRIVER_OK
sd 9:0:0:0: [sdc] tag#0 CDB: Read(10) 28 00 00 19 a9 00 00 00 f0 00
blk_update_request: I/O error, dev sdc, sector 1681664
xhci_hcd 0000:0e:00.0: Stopped the command ring failed, maybe the host
is dead
xhci_hcd 0000:0e:00.0: Host not halted after 16000 microseconds.
xhci_hcd 0000:0e:00.0: Abort command ring failed
xhci_hcd 0000:0e:00.0: HC died; cleaning up

At this point, a reboot is required to reactivate the controller,
unloading and reloading the xhci_* modules does not work.


With 120MB/s I assume it was a USB3 device.
Was there any USB 2 device connected as well?
Does this occur with only a USB2 device connected to xhci?

xhci handles suspend/resume a bit differently for USB2 and USB3 roothubs.

Does this happen on older kernels as well? 4.3 or 4.2 based?

For more xhci debugging, do:
echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control
and check dmesg for more xhci info.

If reloading the module did not help it is more likely that the
controller is in some
unexpected state.
If however, it would instead be just bad timeout timer handling we
could just return immediately
in the timeout handler, and check if the usb device(s) continue to
work normally.

This could be done by editing drivers/usb/hosts/xhci-ring.c

+++ b/drivers/usb/host/xhci-ring.c
@@ -831,6 +831,7 @@ void xhci_stop_endpoint_command_watchdog(unsigned
long arg)
         struct xhci_virt_ep *ep;
         int ret, i, j;
         unsigned long flags;
+       return;

-Mathias


Hello Mat,

thanks for your response. I have experimented with your suggestions.

As for your questions: No, there was only one USB3 stick connected to
the host controller during the tests. USB2 devices work fine too.

Yes, I encountered this problem on a 4.1 series kernel aswell as the 4.4
series.

I have enabled the debug controls and attached the results to this mail,
along with some commentary. I am hoping this works in the mailing list.

I've also tried your suggested modification, and it does seem to work!
With it, the controller does not die, but it still sacrifices a lot of
speed (as I had mentioned in the first mail of this thread)


I hope this is helpful!


Thanks, it is helpful

Looks like when the USB3 device is inserted it is first detected as a USB2 device,
then immediately afterwars as a USB3 device, the usb2 device stops responding so 5
seconds later we timeout, and kill everything.

selected parts of the log:

inserting usb3 storage device
20:03:33 xhci_hcd 0000:0e:00.0: xhci_resume: starting port polling.
20:03:33 xhci_hcd 0000:0e:00.0: Port Status Change Event for port 3
20:03:33 xhci_hcd 0000:0e:00.0: get port status, actual port 0 status  = 0x202e1  /* PORT 0
20:03:33 xhci_hcd 0000:0e:00.0: get port status, actual port 1 status  = 0x2a0     /* PORT 1
20:03:33 usb 1-1: new high-speed USB device number 2 using xhci_hcd
20:03:33 xhci_hcd 0000:0e:00.0: Slot ID 1 Input Context:			/* Found a HS device
20:03:33 xhci_hcd 0000:0e:00.0: IN Endpoint 00 Context (ep_index 00):
20:03:33 xhci_hcd 0000:0e:00.0: @ffff8805fc8a5048 (virt) @ffffa048 (dma) 0xfffdf001 - deq
20:03:33 xhci_hcd 0000:0e:00.0: Successful setup context command
 *   now we have a device at SLOT 1 with control endpoint 0 buffer at address  0xfffdf000
20:03:33 xhci_hcd 0000:0e:00.0: Slot ID 2 Input Context:
20:03:33 xhci_hcd 0000:0e:00.0: IN Endpoint 00 Context (ep_index 00):
20:03:33 xhci_hcd 0000:0e:00.0: @ffff8800b68d7048 (virt) @ffff2048 (dma) 0xfffe1001 - deq
 * now we have another device at SLOT 2 with control endpoint buffer at 0xfffe1000
20:03:33 usb 2-1: new SuperSpeed USB device number 3 using xhci_hcd     /* found SS device
20:03:33 usb 2-1: New USB device found, idVendor=0951, idProduct=1666
20:03:33 usb 2-1: Product: DataTraveler 3.0
20:03:33 usb 2-1: Manufacturer: Kingston
20:03:33 usb 2-1: SerialNumber: AC220B280C8FBFA1F96CA020
20:03:33 usb-storage 2-1:1.0: USB Mass Storage device detected
20:03:34 sd 7:0:0:0: [sdc] Attached SCSI removable disk
20:03:38 xhci_hcd 0000:0e:00.0: Cancel URB ffff8805fc825600, dev 1, ep 0x0, starting at offset 0xfffdf000
  * URB placed is the HS device control endpoint ring is canceled after a 5 second timeout.

We try to remove the cancelled URB from the control endpoint ring of the USB2 HS device,
we fail in stopping the ring, (probably because there is no real USB2 device running anymore)
and then kill everything.

I need to look at this in more detail, check if the speed changes on port reset, is there
some race in resume code or somewhere else, or if we don't we give the link training enough time
on USB3 side before starting USB2 device initialization.

What does the log look like when attaching the USB3 storage device with runtime power disabled?
I'd guess there is no HS device detected at all.

-Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html