On Sat, 13 Sep 2014, Joe Lawrence wrote: > Hi Alan, > > I've collected 16 crashes since kicking off automated tests a little > over 24 hrs ago. > > Each crash hit the BUG in qh_destroy() and the only unique debugging > printk from ehci_stop() was: "ehci_stop: ehci->num_async = 0". What about error messages from the check_async_ring routine? > I can include (or upload) a full (or filtered) vmcore-dmesg.txt if that > would be more helpful. Only if it includes one of those messages. > The debug code I added as you suggested is provided below... > > Thanks, > > -- Joe > > > diff -Nupr before/drivers/usb/host/ehci.h after/drivers/usb/host/ehci.h > --- before/drivers/usb/host/ehci.h 2014-07-16 14:25:31.000000000 -0400 > +++ after/drivers/usb/host/ehci.h 2014-09-12 15:47:18.943478397 -0400 > @@ -231,6 +231,8 @@ struct ehci_hcd { /* one per controlle > > /* platform-specific data -- must come last */ > unsigned long priv[0] __aligned(sizeof(s64)); > + > + int num_async; > }; Not a good idea to put your new field _after_ a field that is documented as coming last, but in this case it probably won't hurt. > /* convert between an HCD pointer and the corresponding EHCI_HCD */ > diff -Nupr before/drivers/usb/host/ehci-hcd.c after/drivers/usb/host/ehci-hcd.c > --- before/drivers/usb/host/ehci-hcd.c 2014-07-16 14:25:31.000000000 -0400 > +++ after/drivers/usb/host/ehci-hcd.c 2014-09-12 15:53:16.863163627 -0400 > @@ -440,6 +440,8 @@ static void ehci_stop (struct usb_hcd *h > if (ehci->amd_pll_fix == 1) > usb_amd_dev_put(); > > + pr_err("%s: ehci->num_async = %d\n", __func__, ehci->num_async); > + > dbg_status (ehci, "ehci_stop completed", > ehci_readl(ehci, &ehci->regs->status)); > } Always 0, as it should be. > diff -Nupr before/drivers/usb/host/ehci-q.c after/drivers/usb/host/ehci-q.c > --- before/drivers/usb/host/ehci-q.c 2014-07-16 14:25:31.000000000 -0400 > +++ after/drivers/usb/host/ehci-q.c 2014-09-12 15:52:08.023292291 -0400 > @@ -959,6 +959,24 @@ static void disable_async(struct ehci_hc > ehci_poll_ASS(ehci); > } > > +static void check_async_ring(struct ehci_hcd *ehci, int add) > +{ > + struct ehci_qh *qh; > + int n; > + > + qh = ehci->async->qh_next.qh; > + n = ehci->num_async += add; > + while (qh && n > 0) { > + qh = qh->qh_next.qh; > + --n; > + } > + if (qh || n != 0) { > + ehci_err(ehci, "EHCI async list corrupted: num %d n %d qh %p\n", > + ehci->num_async, n, qh); > + BUG(); > + } > +} In the last call to this function, ehci->num_async must get set to 0 (because that is the final value as printed out in ehci_stop). This means n = 0, which means ehci->async->qh_next.qh must be NULL. And this routine is the only place where ehci->num_async gets changed. But then ehci->async->qh_next.qh isn't NULL when the final qh_destroy call is made. This suggests that something changes it in the meantime. You can check for this. Sprinkle ehci_info messages throughout ehci_stop, printing the value of ehci->async->qh_next.qh. It should be NULL the entire time. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html