On Thu, 31 Mar 2011, Don Zickus wrote: > I am working with a customer who has a pair of systems with a single CDROM > that switches between the systems during a failover. They ran a stress test > and after a few hours and thousands of failovers, they stumbled upon this > BUG() in drivers/usb/echi-mem.c > > static void qh_destroy(struct ehci_qh *qh) > { > struct ehci_hcd *ehci = qh->ehci; > > /* clean qtds first, and know this is not linked */ > if (!list_empty (&qh->qtd_list) || qh->qh_next.ptr) { > ^^^^^^^^^^^^^^^ > ehci_dbg (ehci, "unused qh not empty!\n"); > BUG (); > } > > Their analysis is as follows: > > ------------------ > We believe we've tracked down the root cause of this problem. It > relates to hot-unplug (surprise disappearance) of an ehci_hcd device. > > The problem begins at the top of routine scan_async() in ehci-q.c: > > 1242 ehci->stamp = ehci_readl(ehci, &ehci->regs->frame_index); > > If the ehci has gone away, this ehci_readl() will return ~0U (all > ones), per the PCI spec. > > If there are completions to process, the conditional > > 1249 if (!list_empty (&qh->qtd_list) > 1250 && qh->stamp != ehci->stamp) > > will be true, so qh->stamp will get the value ~0U, and control will > come back around to the 'rescan:' label. > > This time around, presumably, the qtd list is empty, so we'll skip > the first conditional, and begin executing the second one: > > 1275 if (list_empty(&qh->qtd_list) > 1276 && qh->qh_state == QH_STATE_LINKED) { > 1277 if (!ehci->reclaim > 1278 && ((ehci->stamp - qh->stamp) & 0x1fff) > 1279 >= (EHCI_SHRINK_FRAMES * 8)) > 1280 start_unlink_async(ehci, qh); > > Assuming that the other conditions are right, the subtraction of the > two timestamps will evaluate to 0, so the comparison will fail, and > start_unlink_async() will not get called. > > Even if this code is executed again later, the two stamps will always > have this same value, so start_unlink_async() will never get invoked on > this qh. > > In our case, the qh that's not being unlinked happens to be ehci->async->next. > So, when we later invoke pci removal on the ehci HCD device, we trip > over this, from ehci-mem.c, while trying to destroy ehci->async > > 67 static void qh_destroy(struct ehci_qh *qh) > 68 { > 69 struct ehci_hcd *ehci = qh->ehci; > 70 > 71 /* clean qtds first, and know this is not linked */ > 72 if (!list_empty (&qh->qtd_list) || qh->qh_next.ptr) { > 73 ehci_dbg (ehci, "unused qh not empty!\n"); > 74 BUG (); > 75 } > > Since we still have this "undead" qh hanging off of qh->qh_next, > we take the bugcheck here. The analysis sounds good. That is indeed a possible failure mode. > ------------- > > Their solution was to add a check in the code path > > 1275 if (list_empty(&qh->qtd_list) > 1276 && qh->qh_state == QH_STATE_LINKED) { > 1277 if (!ehci->reclaim > 1278 && ((ehci->stamp - qh->stamp) & 0x1fff) > 1279 >= (EHCI_SHRINK_FRAMES * 8)) > 1280 start_unlink_async(ehci, qh); > > to also check for ehci->stamp == ~0U. It seemed to solve their problem. > > We aren't sure if this is the correct way to solve the problem and are > looking for guidance from people more knowledgable. > > Thoughts? It's not the best solution. We want to prevent this failure mode from _ever_ occurring, not just from occurring when the controller has been removed. > Cheers, > Don > > -- > Note: this problem was discovered, analyzed, and solved on RHEL-6 which is > closely related to 2.6.32 with regards to the usb2 stack. The customer > doesn't have the time to use upstream kernels. The changes in this area > since 2.6.32 seemed minimal enough, that I believe the problem is still > relevant. See what they think of the patch below. Note: Totally untested. I haven't tried running it myself. Alan Stern Index: usb-2.6/drivers/usb/host/ehci-q.c =================================================================== --- usb-2.6.orig/drivers/usb/host/ehci-q.c +++ usb-2.6/drivers/usb/host/ehci-q.c @@ -1257,24 +1257,27 @@ static void start_unlink_async (struct e static void scan_async (struct ehci_hcd *ehci) { + bool stopped; struct ehci_qh *qh; enum ehci_timer_action action = TIMER_IO_WATCHDOG; ehci->stamp = ehci_readl(ehci, &ehci->regs->frame_index); timer_action_done (ehci, TIMER_ASYNC_SHRINK); rescan: + stopped = !HC_IS_RUNNING(ehci_to_hcd(ehci)->state); qh = ehci->async->qh_next.qh; if (likely (qh != NULL)) { do { /* clean any finished work for this qh */ - if (!list_empty (&qh->qtd_list) - && qh->stamp != ehci->stamp) { + if (!list_empty(&qh->qtd_list) && (stopped || + qh->stamp != ehci->stamp)) { int temp; /* unlinks could happen here; completion * reporting drops the lock. rescan using * the latest schedule, but don't rescan - * qhs we already finished (no looping). + * qhs we already finished (no looping) + * unless the controller is stopped. */ qh = qh_get (qh); qh->stamp = ehci->stamp; @@ -1295,9 +1298,9 @@ rescan: */ if (list_empty(&qh->qtd_list) && qh->qh_state == QH_STATE_LINKED) { - if (!ehci->reclaim - && ((ehci->stamp - qh->stamp) & 0x1fff) - >= (EHCI_SHRINK_FRAMES * 8)) + if (!ehci->reclaim && (stopped || + ((ehci->stamp - qh->stamp) & 0x1fff) + >= EHCI_SHRINK_FRAMES * 8)) start_unlink_async(ehci, qh); else action = TIMER_ASYNC_SHRINK; -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html