I am working with a customer who has a pair of systems with a single CDROM that switches between the systems during a failover. They ran a stress test and after a few hours and thousands of failovers, they stumbled upon this BUG() in drivers/usb/echi-mem.c static void qh_destroy(struct ehci_qh *qh) { struct ehci_hcd *ehci = qh->ehci; /* clean qtds first, and know this is not linked */ if (!list_empty (&qh->qtd_list) || qh->qh_next.ptr) { ^^^^^^^^^^^^^^^ ehci_dbg (ehci, "unused qh not empty!\n"); BUG (); } Their analysis is as follows: ------------------ We believe we've tracked down the root cause of this problem. It relates to hot-unplug (surprise disappearance) of an ehci_hcd device. The problem begins at the top of routine scan_async() in ehci-q.c: 1242 ehci->stamp = ehci_readl(ehci, &ehci->regs->frame_index); If the ehci has gone away, this ehci_readl() will return ~0U (all ones), per the PCI spec. If there are completions to process, the conditional 1249 if (!list_empty (&qh->qtd_list) 1250 && qh->stamp != ehci->stamp) will be true, so qh->stamp will get the value ~0U, and control will come back around to the 'rescan:' label. This time around, presumably, the qtd list is empty, so we'll skip the first conditional, and begin executing the second one: 1275 if (list_empty(&qh->qtd_list) 1276 && qh->qh_state == QH_STATE_LINKED) { 1277 if (!ehci->reclaim 1278 && ((ehci->stamp - qh->stamp) & 0x1fff) 1279 >= (EHCI_SHRINK_FRAMES * 8)) 1280 start_unlink_async(ehci, qh); Assuming that the other conditions are right, the subtraction of the two timestamps will evaluate to 0, so the comparison will fail, and start_unlink_async() will not get called. Even if this code is executed again later, the two stamps will always have this same value, so start_unlink_async() will never get invoked on this qh. In our case, the qh that's not being unlinked happens to be ehci->async->next. So, when we later invoke pci removal on the ehci HCD device, we trip over this, from ehci-mem.c, while trying to destroy ehci->async 67 static void qh_destroy(struct ehci_qh *qh) 68 { 69 struct ehci_hcd *ehci = qh->ehci; 70 71 /* clean qtds first, and know this is not linked */ 72 if (!list_empty (&qh->qtd_list) || qh->qh_next.ptr) { 73 ehci_dbg (ehci, "unused qh not empty!\n"); 74 BUG (); 75 } Since we still have this "undead" qh hanging off of qh->qh_next, we take the bugcheck here. ------------- Their solution was to add a check in the code path 1275 if (list_empty(&qh->qtd_list) 1276 && qh->qh_state == QH_STATE_LINKED) { 1277 if (!ehci->reclaim 1278 && ((ehci->stamp - qh->stamp) & 0x1fff) 1279 >= (EHCI_SHRINK_FRAMES * 8)) 1280 start_unlink_async(ehci, qh); to also check for ehci->stamp == ~0U. It seemed to solve their problem. We aren't sure if this is the correct way to solve the problem and are looking for guidance from people more knowledgable. Thoughts? Cheers, Don -- Note: this problem was discovered, analyzed, and solved on RHEL-6 which is closely related to 2.6.32 with regards to the usb2 stack. The customer doesn't have the time to use upstream kernels. The changes in this area since 2.6.32 seemed minimal enough, that I believe the problem is still relevant. --- drivers/usb/host/ehci-q.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c index 98ded66..5cb649e 100644 --- a/drivers/usb/host/ehci-q.c +++ b/drivers/usb/host/ehci-q.c @@ -1287,7 +1287,8 @@ rescan: && qh->qh_state == QH_STATE_LINKED) { if (!ehci->reclaim && ((ehci->stamp - qh->stamp) & 0x1fff) - >= (EHCI_SHRINK_FRAMES * 8)) + >= (EHCI_SHRINK_FRAMES * 8) || + (ehci->stamp == ~0U)) start_unlink_async(ehci, qh); else action = TIMER_ASYNC_SHRINK; -- 1.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html