Re: Help with xHCI, uvcvideo driver, and "Unknown event condition, HC probably busted"

Sarah Sharp <sarah.a.sharp@xxxxxxxxxxxxxxx> · Mon, 8 Apr 2013 10:18:28 -0700

Hi Ryan,

I'm the xHCI driver author.  It helps if you Cc maintainers (listed in
the MAINTAINERS file), otherwise emails do get lost:

http://lxr.linux.no/#linux/MAINTAINERS

Comments below.

On Tue, Mar 26, 2013 at 02:55:48PM -0700, Ryan Press wrote:
> I'm working to get the 3.9-rc1 kernel working on the Globalscale
> Mirabox http://www.globalscaletechnologies.com/p-58-mirabox-development-kit.aspx.
>  The PCIe driver by Thomas Petazzoni is not in mainline yet but I have
> his latest patch; this is required for the FL1009 host controller.
> 
> I have everything mostly working.  A USB 3.0 SSD works great, I tested
> un-buffered speed at 120 MB/s.  I am trying to get a USB 2.0 video
> capture device working and I'm running into problems.  So far I have
> tested using the uvcvideo and stk1160.  They both fail with "xhci_hcd
> 0000:02:00.0: ERROR Unknown event condition, HC probably busted",
> although the uvcvideo device does work for perhaps a minute with
> smooth video.  Sometimes it resets the host controller and the
> attached SSD goes offline.
> 
> On a different ARM box running 3.8-rc1 kernel the stk1160 is known
> working, albeit it has only a USB 2.0 host controller.  So I don't
> suspect this driver necessarily.

I don't suppose you have a different USB 3.0 host controller to attach
the camera to?

> Below is my dmesg log.  I would think the PCIe driver could be the
> problem but because the USB 3.0 SSD works perfectly I'm not sure this
> is the case.  Does anyone have insight into what's going on?

The dmesg looks worrisome.  It looks like something is wrong with the
xHCI driver after the endpoint rings are expanded:

> xhci_hcd 0000:02:00.0: ERROR no room on ep ring, try ring expansion
> xhci_hcd 0000:02:00.0: ring expansion succeed, now has 4 segments
> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> ...
> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> xhci_hcd 0000:02:00.0: overrun event on endpoint
> xhci_hcd 0000:02:00.0: Overrun Event for slot 2 ep 2 still with TDs queued?
> xhci_hcd 0000:02:00.0: All tds on the ep_ring skipped. Clear skip flag.
> xhci_hcd 0000:02:00.0: ERROR Unknown event condition, HC probably busted
> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> ...

I'm concerned that the ring dequeue pointer somehow got out of sync with
the hardware dequeue pointer here.

> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> xhci_hcd 0000:02:00.0: overrun event on endpoint
> xhci_hcd 0000:02:00.0: Overrun Event for slot 2 ep 2 still with TDs queued?
> xhci_hcd 0000:02:00.0: All tds on the ep_ring skipped. Clear skip flag.
> xhci_hcd 0000:02:00.0: ERROR Unknown event condition, HC probably busted

That error will only print when there's a vendor-specific error code, or
some other completion code that the xHCI driver doesn't understand.

> xhci_hcd 0000:02:00.0: ep 0x87 - asked for 16 bytes, 7 bytes untransferred
> usb 3-2: Driver uses different interval (8 microframes) than xHCI (128
> microframes)
> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> xhci_hcd 0000:02:00.0: Found td. Clear skip flag.
> xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD

This error code may be due to a short packet (we have a known issue with
the Linux xHCI driver not properly handing extra events for short
packets on xHCI 1.0 hosts), or it may indicate the ring expansion was
causing issues, or it may indicate the device was just suspended.

> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> xhci_hcd 0000:02:00.0: ERROR Unknown event condition, HC probably busted
> xhci_hcd 0000:02:00.0: Found td. Clear skip flag.
> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> ...
> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> xhci_hcd 0000:02:00.0: overrun event on endpoint
> xhci_hcd 0000:02:00.0: Overrun Event for slot 2 ep 2 still with TDs queued?
> xhci_hcd 0000:02:00.0: All tds on the ep_ring skipped. Clear skip flag.
> xhci_hcd 0000:02:00.0: ERROR Unknown event condition, HC probably busted
> xhci_hcd 0000:02:00.0: ep 0x87 - asked for 16 bytes, 9 bytes untransferred
> usb 3-2: Driver uses different interval (8 microframes) than xHCI (128
> microframes)
> ...
> xhci_hcd 0000:02:00.0: URB transfer length is wrong, xHC issue? req.
> len = 97920, act. len = 4294965524

I wonder if you're running into the event transfer length issue that was
fixed in recent 3.9 kernels:

commit 1c11a172cb30492f5f6a82c6e118fdcd9946c34f
Author: Vivek Gautam <gautam.vivek@xxxxxxxxxxx>
Date:   Thu Mar 21 12:06:48 2013 +0530

    usb: xhci: Fix TRB transfer length macro used for Event TRB.
    
    Use proper macro while extracting TRB transfer length from
    Transfer event TRBs. Adding a macro EVENT_TRB_LEN (bits 0:23)
    for the same, and use it instead of TRB_LEN (bits 0:16) in
    case of event TRBs.
    
    This patch should be backported to kernels as old as 2.6.31, that
    contain the commit b10de142119a676552df3f0d2e3a9d647036c26a "USB: xhci:
    Bulk transfer support".  This patch will have issues applying to older
    kernels.
    
    Signed-off-by: Vivek gautam <gautam.vivek@xxxxxxxxxxx>
    Signed-off-by: Sarah Sharp <sarah.a.sharp@xxxxxxxxxxxxxxx>
    Cc: stable@xxxxxxxxxxxxxxx

Either that, or the xHCI host is giving back completely bogus
untransferred lengths.

> stk1160: allocating urbs...
> stk1160: 16 urbs allocated
> xhci_hcd 0000:02:00.0: ERROR no room on ep ring, try ring expansion
> xhci_hcd 0000:02:00.0: ring expansion succeed, now has 4 segments
> xhci_hcd 0000:02:00.0: overrun event on endpoint
> xhci_hcd 0000:02:00.0: Overrun Event for slot 2 ep 4 still with TDs queued?
> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> ...
> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> xhci_hcd 0000:02:00.0: Found td. Clear skip flag.
> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> xhci_hcd 0000:02:00.0: Found td. Clear skip flag.
> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> ...
> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> xhci_hcd 0000:02:00.0: overrun event on endpoint
> xhci_hcd 0000:02:00.0: Overrun Event for slot 2 ep 4 still with TDs queued?
> print_err_status: 458 callbacks suppressed
> URB packet 50, status -18 [Unknown].
> URB packet 51, status -18 [Unknown].
> URB packet 52, status -18 [Unknown].
> URB packet 53, status -18 [Unknown].
> URB packet 54, status -18 [Unknown].
> URB packet 55, status -18 [Unknown].
> URB packet 56, status -18 [Unknown].
> URB packet 57, status -18 [Unknown].
> URB packet 58, status -18 [Unknown].
> URB packet 59, status -18 [Unknown].

-18 is EXDEV.  That's due to the xHCI host controller skipping several
isochronous frames.  I *think* that's a valid URB frame status, so the
stk1160 driver needs to handle it like a short packet.  If it's not a
valid URB frame status, the xHCI driver needs to use a different URB
error code.

> xhci_hcd 0000:02:00.0: All tds on the ep_ring skipped. Clear skip flag.
> xhci_hcd 0000:02:00.0: ERROR Unknown event condition, HC probably busted
> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> xhci_hcd 0000:02:00.0: Found td. Clear skip flag.
> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> xhci_hcd 0000:02:00.0: Found td. Clear skip flag.
> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> ...
> xhci_hcd 0000:02:00.0: Miss service interval error, set skip flag
> xhci_hcd 0000:02:00.0: All tds on the ep_ring skipped. Clear skip flag.
> xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD
> xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD
> xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD
> xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD
> xhci_hcd 0000:02:00.0: ERROR Unknown event condition, HC probably busted
> xhci_hcd 0000:02:00.0: ERROR Unknown event condition, HC probably busted
> xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD
> xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD
> xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD

Yeah, those messages make me lean more towards a problem in the xHCI
ring handling.

> xhci_hcd 0000:02:00.0: Cancel URB ee642e40, dev 2, ep 0x0, starting at
> offset 0x2e66f1c0
> xhci_hcd 0000:02:00.0: // Ding dong!
> xhci_hcd 0000:02:00.0: xHCI host not responding to stop endpoint command.

Ugh, not good.  The host stopped responding to commands, which should
have nothing to do with the ring expansion code.  Unless the xHCI host
dereferenced a bad ring segment pointer or transfer buffer and went off
into la-la-land...

> xhci_hcd 0000:02:00.0: Assuming host is dying, halting host.
> xhci_hcd 0000:02:00.0: // Halt the HC
> xhci_hcd 0000:02:00.0: Killing URBs for slot ID 1, ep index 0
> xhci_hcd 0000:02:00.0: Killing URBs for slot ID 1, ep index 2
> xhci_hcd 0000:02:00.0: Killing URBs for slot ID 1, ep index 3
> xhci_hcd 0000:02:00.0: Killing URBs for slot ID 2, ep index 0
> xhci_hcd 0000:02:00.0: Killing URBs for slot ID 2, ep index 2
> xhci_hcd 0000:02:00.0: Killing URBs for slot ID 2, ep index 4
> xhci_hcd 0000:02:00.0: Killing URBs for slot ID 2, ep index 8
> xhci_hcd 0000:02:00.0: Calling usb_hc_died()
> xhci_hcd 0000:02:00.0: HC died; cleaning up
> xhci_hcd 0000:02:00.0: xHCI host controller is dead.

And then the driver kills the host, because it isn't responding to
commands.

Ok, I've attached two patches.  Please apply the first patch on top of
the latest 3.9-rc kernel, with CONFIG_USB_DEBUG and
CONFIG_USB_XHCI_HCD_DEBUGGING and attempt to trigger these ERROR
messages.  Then apply the second patch, and see if the messages go away.

Sarah Sharp
>From c244b65b85d867bc17a910b8348b05efd3fa647a Mon Sep 17 00:00:00 2001
From: Sarah Sharp <sarah.a.sharp@xxxxxxxxxxxxxxx>
Date: Mon, 8 Apr 2013 10:12:08 -0700
Subject: [PATCH 1/2] xhci: Add more debugging for ring expansion failures.

Signed-off-by: Sarah Sharp <sarah.a.sharp@xxxxxxxxxxxxxxx>
---
 drivers/usb/host/xhci-ring.c |   16 ++++++++++++++++
 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 1969c00..1c5c4d0 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2455,6 +2455,14 @@ static int handle_tx_event(struct xhci_hcd *xhci,
 		}
 		xhci_warn(xhci, "ERROR Unknown event condition, HC probably "
 				"busted\n");
+		xhci_dbg(xhci, "Event TRB with TRB type ID %u\n",
+				(le32_to_cpu(event->flags) &
+				 TRB_TYPE_BITMASK)>>10);
+		xhci_print_trb_offsets(xhci, (union xhci_trb *) event);
+		xhci_dbg(xhci, "Event ring:\n");
+		xhci_debug_ring(xhci, xhci->event_ring);
+		xhci_dbg(xhci, "Endpoint ring:\n");
+		xhci_debug_ring(xhci, ep_ring);
 		goto cleanup;
 	}
 
@@ -2534,6 +2542,14 @@ static int handle_tx_event(struct xhci_hcd *xhci,
 				xhci_err(xhci,
 					"ERROR Transfer event TRB DMA ptr not "
 					"part of current TD\n");
+				xhci_dbg(xhci, "Event TRB with TRB type ID %u\n",
+						(le32_to_cpu(event->flags) &
+						 TRB_TYPE_BITMASK)>>10);
+				xhci_print_trb_offsets(xhci, (union xhci_trb *) event);
+				xhci_dbg(xhci, "Event ring:\n");
+				xhci_debug_ring(xhci, xhci->event_ring);
+				xhci_dbg(xhci, "Endpoint ring:\n");
+				xhci_debug_ring(xhci, ep_ring);
 				return -ESHUTDOWN;
 			}
 
-- 
1.7.9

>From 0d2c8a867699315278409d46772e6d54c7af4137 Mon Sep 17 00:00:00 2001
From: Sarah Sharp <sarah.a.sharp@xxxxxxxxxxxxxxx>
Date: Mon, 8 Apr 2013 10:12:29 -0700
Subject: [PATCH 2/2] xhci: Statically allocate ten ring segments.

Should be enough ring segments?  Add more if you see the message:

ERROR no room on ep ring, try ring expansion

Signed-off-by: Sarah Sharp <sarah.a.sharp@xxxxxxxxxxxxxxx>
---
 drivers/usb/host/xhci-mem.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/usb/host/xhci-mem.c b/drivers/usb/host/xhci-mem.c
index 965b539..be2b773 100644
--- a/drivers/usb/host/xhci-mem.c
+++ b/drivers/usb/host/xhci-mem.c
@@ -1391,7 +1391,7 @@ int xhci_endpoint_init(struct xhci_hcd *xhci,
 	type = usb_endpoint_type(&ep->desc);
 	/* Set up the endpoint ring */
 	virt_dev->eps[ep_index].new_ring =
-		xhci_ring_alloc(xhci, 2, 1, type, mem_flags);
+		xhci_ring_alloc(xhci, 10, 1, type, mem_flags);
 	if (!virt_dev->eps[ep_index].new_ring) {
 		/* Attempt to use the ring cache */
 		if (virt_dev->num_rings_cached == 0)
-- 
1.7.9