On Fri, 12 Sep 2014 11:31:46 -0400 Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote: > On Thu, 11 Sep 2014, Joe Lawrence wrote: > > > Hi Alan, > > > > I've got another USB bug to report that manifests during automated > > device removal testing on RHEL7. This one hits the BUG() inside > > qh_destroy: > > How reliably can you trigger this bug? I have collected a few crashes within a few days, so somewhat frequently. > > 67 static void qh_destroy(struct ehci_hcd *ehci, struct ehci_qh *qh) > > 68 { > > 69 /* clean qtds first, and know this is not linked */ > > 70 if (!list_empty (&qh->qtd_list) || qh->qh_next.ptr) { > > 71 ehci_dbg (ehci, "unused qh not empty!\n"); > > 72 BUG (); > > 73 } > > > and finally a dump of the ehci_qh in question: > > > > crash> struct ehci_qh ffff88084b84dc80 > > struct ehci_qh { > > hw = 0xffff880078d1a000, > > It would be good to see the contents of the ehci_qh_hw structure. That > would tell us what device and endpoint this QH was for. crash> struct ehci_qh_hw 0xffff880078d1a000 struct ehci_qh_hw { hw_next = 0x78d1a062, hw_info1 = 0x8000, hw_info2 = 0x0, hw_current = 0x0, hw_qtd_next = 0x1, hw_alt_next = 0x78d22000, hw_token = 0x40, hw_buf = {0x0, 0x0, 0x0, 0x0, 0x0}, hw_buf_hi = {0x0, 0x0, 0x0, 0x0, 0x0} } > > qh_dma = 0x78d1a000, > > qh_next = { > > qh = 0xffff88084efe6730, > > itd = 0xffff88084efe6730, > > sitd = 0xffff88084efe6730, > > fstn = 0xffff88084efe6730, > > hw_next = 0xffff88084efe6730, > > ptr = 0xffff88084efe6730 << !NULL > > }, > > qtd_list = { << list_empty > > next = 0xffff88084b84dc98, > > prev = 0xffff88084b84dc98 > > }, > > intr_node = { > > next = 0x0, > > prev = 0x0 > > }, > > dummy = 0xffff880078d22000, > > unlink_node = { > > next = 0xffff88084b84dcc0, > > prev = 0xffff88084b84dcc0 > > }, > > unlink_cycle = 0x0, > > qh_state = 0x1, << QH_STATE_LINKED > ... > > } > > > > The qtd_list is empty, contains only one entry, itself. > > > > crash> struct -o ehci_qh | grep td_list > > [0x18] struct list_head qtd_list; > > crash> p/x 0xffff88084b84dc80 + 0x18 > > $1 = 0xffff88084b84dc98 > > > > but qh->qh_next.ptr is !NULL, so we hit the BUG. However, it seems that > > the memory at qh->qh_next.ptr has been freed: > > > I'm not too familiar with the USB code stack, so any suggestions on > > instrumentation that I can add to aid in debugging would be helpful. > > Maybe some tracing in qh_link_async / single_unlink_async / > > end_unlink_async /qh_link_periodic can reveal the sequence that is > > leaving this dangling qh_next.ptr? > > The place to look is ehci_endpoint_disable. Did that routine get > called for this QH? Did it hit the default case of the big switch > statement (with its ehci_err statement)? Not sure if there is enough residual side-effect data in a crash dump to determine if ehci_endpoint_disable executed. However, the QH that qh_destroy was handling did *not* have the exception bit set. (See the first mail for the structure dump.) Would it be reasonable to add printk debugging messages to ehci_endpoint_disable to trace the QH in question and its qh_state? > > Note: This does bear some resemblance to a bug that Stratus hit a few > > years ago [1] [2], however enough of the code has changed that I'm not > > sure the fix for that one would apply to a modern kernel. > > What version of the driver are you currently running? The driver is built into a slightly modified RHEL7 3.10.0-123.6.3.el7.x86_64 kernel. Regards, -- Joe -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html