On Tue, 16 Sep 2014 10:44:45 -0400 Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote: > On Tue, 16 Sep 2014, Joe Lawrence wrote: > > > > You can check for this. Sprinkle ehci_info messages throughout > > > ehci_stop, printing the value of ehci->async->qh_next.qh. It should be > > > NULL the entire time. > > > > I re-ran the tests with the following changes: > > > > - moved num_async debug up above struct ehci_hcd.priv[0] :) > > - added verbose debug msg in check_async_ring to display the new > > num_async value > > - added ehci_info to every other line in ehci_stop > > > > % grep '0000:66:00.1' vmcore-dmesg.txt > > > > [ 4936.956739] pci 0000:66:00.1: [8086:1d2d] type 00 class 0x0c0320 > > [ 4936.956789] pci 0000:66:00.1: reg 0x10: [mem 0xa0000400-0xa00007ff] > > [ 4936.957019] pci 0000:66:00.1: PME# supported from D0 D3hot D3cold > > [ 4939.564933] ehci-pci 0000:66:00.1: EHCI Host Controller > > [ 4939.565462] ehci-pci 0000:66:00.1: new USB bus registered, assigned bus number 4 > > [ 4939.565498] ehci-pci 0000:66:00.1: debug port 2 > > [ 4939.569466] ehci-pci 0000:66:00.1: cache line size of 64 is not supported > > [ 4939.569533] ehci-pci 0000:66:00.1: irq 13, io mem 0xa0000400 > > [ 4939.574794] ehci-pci 0000:66:00.1: USB 2.0 started, EHCI 1.00 > > [ 4939.575410] usb usb4: SerialNumber: 0000:66:00.1 > > [ 4940.089044] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558 > > [ 4941.795700] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 0, qh= (null) > > [ 4941.895713] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558 > > [ 4941.943761] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 0, qh= (null) > > [ 4942.002715] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558 > > [ 4942.064346] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 0, qh= (null) > > [ 4942.156646] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558 > > [ 4942.226028] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 0, qh= (null) > > [ 4942.274185] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558 > > [ 4942.329196] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 0, qh= (null) > > [ 4942.391055] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558 > > [ 4942.629653] ehci-pci 0000:66:00.1: HC died; cleaning up > > [ 4945.663153] ehci-pci 0000:66:00.1: remove, state 1 > > [ 4945.733934] ehci-pci 0000:66:00.1: ehci_stop:420 ehci(ffff8808419839d0)->num_async = 1 > > [ 4945.774242] ehci-pci 0000:66:00.1: ehci_stop:427 ehci(ffff8808419839d0)->num_async = 1 > > [ 4945.814791] ehci-pci 0000:66:00.1: ehci_stop:429 ehci(ffff8808419839d0)->num_async = 1 > > [ 4945.855938] ehci-pci 0000:66:00.1: ehci_stop:431 ehci(ffff8808419839d0)->num_async = 1 > > [ 4945.897145] ehci-pci 0000:66:00.1: ehci_stop:433 ehci(ffff8808419839d0)->num_async = 1 > > [ 4945.938791] ehci-pci 0000:66:00.1: ehci_stop:436 ehci(ffff8808419839d0)->num_async = 1 > > [ 4945.980491] ehci-pci 0000:66:00.1: ehci_stop:438 ehci(ffff8808419839d0)->num_async = 1 > > [ 4946.022221] ehci-pci 0000:66:00.1: ehci_stop:440 ehci(ffff8808419839d0)->num_async = 1 > > [ 4946.063936] ehci-pci 0000:66:00.1: ehci_stop:444 ehci(ffff8808419839d0)->num_async = 1 > > [ 4946.105631] ehci-pci 0000:66:00.1: ehci_stop:446 ehci(ffff8808419839d0)->num_async = 1 > > [ 4946.241802] ehci-pci 0000:66:00.1: ehci_stop:448 ehci(ffff8808419839d0)->num_async = 1 > > > > where ehci_stop line 420 is function entry and line 448 function > > return. Notice that check_async_ring incremented num_async just before > > the HC died/removed and then num_async was 1 throughout ehci_stop. > > > > qh_destroy then BUGs with a struct ehci_hcd *ehci of ffff8808419839d0. > > This says that ehci->num_async ended up equal to 1. But in your > previous email, you said the final printout indicated it was equal to > 0. Did something change in the meantime? I just double checked the logs, and that is correct. However, the earlier reports of a dozen or so crashes included num_async at the end of struct ehci_hcd. Perhaps the priv[] array clobbered the value? Sorry for the confusion. > Anyway, the log above means that a QH was linked before the HC died, > but then it was never unlinked. Please add a line at the start of > ehci_endpoint_disable, printing out the value of ep->hcpriv for > comparison with the values listed above. Will do. > Also, let's see the contents of *qh and *(qh->hw) for the left-over QH > at *(ehci->async->qh_next.qh). You can get those right away from your > crash dump data, right? Yep, interestingly enough, *(ehci->async->qh_next.qh) appears to be mostly full of zeros: ------------[ cut here ]------------ kernel BUG at drivers/usb/host/ehci-mem.c:72! invalid opcode: 0000 [#1] SMP CPU: 16 PID: 17572 Comm: kworker/u66:1 Tainted: PF W O-------------- 3.10.0-123.6.3.el7.bigphysarea_expedient1_usb3.x86_64 #1 Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013 Workqueue: ftworker io_poll [ftmod] task: ffff881051320000 ti: ffff8810440aa000 task.ti: ffff8810440aa000 RIP: 0010:[<ffffffff81412136>] [<ffffffff81412136>] qh_destroy+0x66/0x90 RSP: 0018:ffff8810440ab870 EFLAGS: 00010246 RAX: ffff88085202b588 RBX: ffff88085202b570 RCX: 0000000000000000 RDX: ffff88107fc10040 RSI: ffff88085202b570 RDI: ffff8808419839d0 RBP: ffff8810440ab880 R08: 0000000000000082 R09: 00000000000061c8 R10: 000000000000000f R11: 0000000007070707 R12: ffff8808419839d0 R13: ffff880841983abc R14: ffff88084e425700 R15: 0000000000000048 FS: 0000000000000000(0000) GS:ffff88107fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7727dfe010 CR3: 000000104198c000 CR4: 00000000000407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff8808419839d0 ffff8808419839d0 ffff8810440ab8c0 ffffffff81413388 0000000000000000 0000000000000046 ffff8808419837b0 ffff8808419839d0 ffff880841983abc ffff88084e425700 ffff8810440ab948 ffffffff81417d0f Call Trace: [<ffffffff81413388>] ehci_mem_cleanup+0x28/0x180 [<ffffffff81417d0f>] ehci_stop+0x29f/0x410 [<ffffffff8106e35b>] ? lock_timer_base.isra.35+0x2b/0x50 [<ffffffff8106f04e>] ? try_to_del_timer_sync+0x5e/0x90 [<ffffffff813f8d38>] usb_remove_hcd+0x108/0x1d0 [<ffffffff8140a15f>] usb_hcd_pci_remove+0x12f/0x150 [<ffffffff812ec5bb>] pci_device_remove+0x3b/0xb0 [<ffffffff813b6b9f>] __device_release_driver+0x7f/0xf0 [<ffffffff813b6c33>] device_release_driver+0x23/0x30 [<ffffffff813b63a8>] bus_remove_device+0x108/0x180 [<ffffffff813b2955>] device_del+0x135/0x1d0 [<ffffffff812e54b4>] pci_stop_bus_device+0x94/0xa0 [<ffffffff812e545b>] pci_stop_bus_device+0x3b/0xa0 [<ffffffff812e545b>] pci_stop_bus_device+0x3b/0xa0 [<ffffffff812e55a2>] pci_stop_and_remove_bus_device+0x12/0x20 [<ffffffffa02ea917>] ioboard_bringdown+0xa7/0xe0 [ftmod] [<ffffffffa02eac07>] ftmod_ioboard_event+0xd7/0x400 [ftmod] [<ffffffffa0538a00>] ? OS_READ_PORT_UINT16+0x20/0x20 [ccmod] [<ffffffffa02e87b2>] io_state_change+0x92/0x130 [ftmod] Quick double-check that assembly still left registers in expected places: drivers/usb/host/ehci-mem.c: 68 <qh_destroy+0xa>: mov %rsp,%rbp <qh_destroy+0xd>: push %r12 <qh_destroy+0xf>: mov %rdi,%r12 ... drivers/usb/host/ehci-mem.c: 68 <qh_destroy+0x17>: mov %rsi,%rbx so: R12: struct ehci_hcd ffff8808419839d0 << ehci RBX: struct ehci_qh ffff88085202b570 << qh crash> struct ehci_hcd ffff8808419839d0 | grep async ... async = 0xffff88085202b570, << ehci->async num_async = 0x1, ... crash> struct ehci_qh 0xffff88085202b570 struct ehci_qh { hw = 0xffff88007ed0b000, qh_dma = 0x7ed0b000, qh_next = { qh = 0xffff88083d6b4558, << ehci->async->qh_next.qh itd = 0xffff88083d6b4558, sitd = 0xffff88083d6b4558, fstn = 0xffff88083d6b4558, hw_next = 0xffff88083d6b4558, ptr = 0xffff88083d6b4558 }, qtd_list = { next = 0xffff88085202b588, prev = 0xffff88085202b588 }, intr_node = { next = 0x0, prev = 0x0 }, dummy = 0xffff88007895f000, unlink_node = { next = 0xffff88085202b5b0, prev = 0xffff88085202b5b0 }, unlink_cycle = 0x0, qh_state = 0x1, xacterrs = 0x0, usecs = 0x0, gap_uf = 0x0, c_usecs = 0x0, tt_usecs = 0x0, period = 0x0, start = 0x0, dev = 0x0, is_out = 0x0, clearing_tt = 0x0, dequeue_during_giveback = 0x0, exception = 0x0 } crash> struct ehci_qh 0xffff88083d6b4558 << *(ehci->async->qh_next.qh) struct ehci_qh { hw = 0x0, qh_dma = 0x0, qh_next = { qh = 0x0, itd = 0x0, sitd = 0x0, fstn = 0x0, hw_next = 0x0, ptr = 0x0 }, qtd_list = { next = 0x0, prev = 0x0 }, intr_node = { next = 0x0, prev = 0x0 }, dummy = 0x0, unlink_node = { next = 0x0, prev = 0x0 }, unlink_cycle = 0x6, qh_state = 0x0, xacterrs = 0x0, usecs = 0x0, gap_uf = 0x0, c_usecs = 0x0, tt_usecs = 0x0, period = 0x0, start = 0x0, dev = 0x6, is_out = 0x0, clearing_tt = 0x0, dequeue_during_giveback = 0x0, exception = 0x0 } Regards, -- Joe -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html