Re: Hitting "unused qh not empty" BUG in qh_destroy

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 16 Sep 2014 10:44:45 -0400
Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote:

> On Tue, 16 Sep 2014, Joe Lawrence wrote:
> 
> > > You can check for this.  Sprinkle ehci_info messages throughout 
> > > ehci_stop, printing the value of ehci->async->qh_next.qh.  It should be 
> > > NULL the entire time.
> > 
> > I re-ran the tests with the following changes:
> > 
> >   - moved num_async debug up above struct ehci_hcd.priv[0] :)
> >   - added verbose debug msg in check_async_ring to display the new
> >     num_async value
> >   - added ehci_info to every other line in ehci_stop
> > 
> > % grep '0000:66:00.1' vmcore-dmesg.txt
> > 
> > [ 4936.956739] pci 0000:66:00.1: [8086:1d2d] type 00 class 0x0c0320
> > [ 4936.956789] pci 0000:66:00.1: reg 0x10: [mem 0xa0000400-0xa00007ff]
> > [ 4936.957019] pci 0000:66:00.1: PME# supported from D0 D3hot D3cold
> > [ 4939.564933] ehci-pci 0000:66:00.1: EHCI Host Controller
> > [ 4939.565462] ehci-pci 0000:66:00.1: new USB bus registered, assigned bus number 4
> > [ 4939.565498] ehci-pci 0000:66:00.1: debug port 2
> > [ 4939.569466] ehci-pci 0000:66:00.1: cache line size of 64 is not supported
> > [ 4939.569533] ehci-pci 0000:66:00.1: irq 13, io mem 0xa0000400
> > [ 4939.574794] ehci-pci 0000:66:00.1: USB 2.0 started, EHCI 1.00
> > [ 4939.575410] usb usb4: SerialNumber: 0000:66:00.1
> > [ 4940.089044] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558
> > [ 4941.795700] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 0, qh=          (null)
> > [ 4941.895713] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558
> > [ 4941.943761] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 0, qh=          (null)
> > [ 4942.002715] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558
> > [ 4942.064346] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 0, qh=          (null)
> > [ 4942.156646] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558
> > [ 4942.226028] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 0, qh=          (null)
> > [ 4942.274185] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558
> > [ 4942.329196] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 0, qh=          (null)
> > [ 4942.391055] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558
> > [ 4942.629653] ehci-pci 0000:66:00.1: HC died; cleaning up
> > [ 4945.663153] ehci-pci 0000:66:00.1: remove, state 1
> > [ 4945.733934] ehci-pci 0000:66:00.1: ehci_stop:420 ehci(ffff8808419839d0)->num_async = 1
> > [ 4945.774242] ehci-pci 0000:66:00.1: ehci_stop:427 ehci(ffff8808419839d0)->num_async = 1
> > [ 4945.814791] ehci-pci 0000:66:00.1: ehci_stop:429 ehci(ffff8808419839d0)->num_async = 1
> > [ 4945.855938] ehci-pci 0000:66:00.1: ehci_stop:431 ehci(ffff8808419839d0)->num_async = 1
> > [ 4945.897145] ehci-pci 0000:66:00.1: ehci_stop:433 ehci(ffff8808419839d0)->num_async = 1
> > [ 4945.938791] ehci-pci 0000:66:00.1: ehci_stop:436 ehci(ffff8808419839d0)->num_async = 1
> > [ 4945.980491] ehci-pci 0000:66:00.1: ehci_stop:438 ehci(ffff8808419839d0)->num_async = 1
> > [ 4946.022221] ehci-pci 0000:66:00.1: ehci_stop:440 ehci(ffff8808419839d0)->num_async = 1
> > [ 4946.063936] ehci-pci 0000:66:00.1: ehci_stop:444 ehci(ffff8808419839d0)->num_async = 1
> > [ 4946.105631] ehci-pci 0000:66:00.1: ehci_stop:446 ehci(ffff8808419839d0)->num_async = 1
> > [ 4946.241802] ehci-pci 0000:66:00.1: ehci_stop:448 ehci(ffff8808419839d0)->num_async = 1
> > 
> > where ehci_stop line 420 is function entry and line 448 function
> > return.  Notice that check_async_ring incremented num_async just before
> > the HC died/removed and then num_async was 1 throughout ehci_stop.
> > 
> > qh_destroy then BUGs with a struct ehci_hcd *ehci of ffff8808419839d0.
> 
> This says that ehci->num_async ended up equal to 1.  But in your 
> previous email, you said the final printout indicated it was equal to 
> 0.  Did something change in the meantime?

I just double checked the logs, and that is correct.  However, the
earlier reports of a dozen or so crashes included num_async at the end
of struct ehci_hcd.  Perhaps the priv[] array clobbered the value?
Sorry for the confusion.

> Anyway, the log above means that a QH was linked before the HC died,
> but then it was never unlinked.  Please add a line at the start of
> ehci_endpoint_disable, printing out the value of ep->hcpriv for
> comparison with the values listed above.

Will do.

> Also, let's see the contents of *qh and *(qh->hw) for the left-over QH
> at *(ehci->async->qh_next.qh).  You can get those right away from your 
> crash dump data, right?

Yep, interestingly enough, *(ehci->async->qh_next.qh) appears to be
mostly full of zeros:

------------[ cut here ]------------
kernel BUG at drivers/usb/host/ehci-mem.c:72!
invalid opcode: 0000 [#1] SMP 
CPU: 16 PID: 17572 Comm: kworker/u66:1 Tainted: PF       W  O--------------   3.10.0-123.6.3.el7.bigphysarea_expedient1_usb3.x86_64 #1
Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
Workqueue: ftworker io_poll [ftmod]
task: ffff881051320000 ti: ffff8810440aa000 task.ti: ffff8810440aa000
RIP: 0010:[<ffffffff81412136>]  [<ffffffff81412136>] qh_destroy+0x66/0x90
RSP: 0018:ffff8810440ab870  EFLAGS: 00010246
RAX: ffff88085202b588 RBX: ffff88085202b570 RCX: 0000000000000000
RDX: ffff88107fc10040 RSI: ffff88085202b570 RDI: ffff8808419839d0
RBP: ffff8810440ab880 R08: 0000000000000082 R09: 00000000000061c8
R10: 000000000000000f R11: 0000000007070707 R12: ffff8808419839d0
R13: ffff880841983abc R14: ffff88084e425700 R15: 0000000000000048
FS:  0000000000000000(0000) GS:ffff88107fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7727dfe010 CR3: 000000104198c000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffff8808419839d0 ffff8808419839d0 ffff8810440ab8c0 ffffffff81413388
 0000000000000000 0000000000000046 ffff8808419837b0 ffff8808419839d0
 ffff880841983abc ffff88084e425700 ffff8810440ab948 ffffffff81417d0f
Call Trace:
 [<ffffffff81413388>] ehci_mem_cleanup+0x28/0x180
 [<ffffffff81417d0f>] ehci_stop+0x29f/0x410
 [<ffffffff8106e35b>] ? lock_timer_base.isra.35+0x2b/0x50
 [<ffffffff8106f04e>] ? try_to_del_timer_sync+0x5e/0x90
 [<ffffffff813f8d38>] usb_remove_hcd+0x108/0x1d0
 [<ffffffff8140a15f>] usb_hcd_pci_remove+0x12f/0x150
 [<ffffffff812ec5bb>] pci_device_remove+0x3b/0xb0
 [<ffffffff813b6b9f>] __device_release_driver+0x7f/0xf0
 [<ffffffff813b6c33>] device_release_driver+0x23/0x30
 [<ffffffff813b63a8>] bus_remove_device+0x108/0x180
 [<ffffffff813b2955>] device_del+0x135/0x1d0
 [<ffffffff812e54b4>] pci_stop_bus_device+0x94/0xa0
 [<ffffffff812e545b>] pci_stop_bus_device+0x3b/0xa0
 [<ffffffff812e545b>] pci_stop_bus_device+0x3b/0xa0
 [<ffffffff812e55a2>] pci_stop_and_remove_bus_device+0x12/0x20
 [<ffffffffa02ea917>] ioboard_bringdown+0xa7/0xe0 [ftmod]
 [<ffffffffa02eac07>] ftmod_ioboard_event+0xd7/0x400 [ftmod]
 [<ffffffffa0538a00>] ? OS_READ_PORT_UINT16+0x20/0x20 [ccmod]
 [<ffffffffa02e87b2>] io_state_change+0x92/0x130 [ftmod]

Quick double-check that assembly still left registers in expected places:

drivers/usb/host/ehci-mem.c: 68
  <qh_destroy+0xa>:    mov    %rsp,%rbp
  <qh_destroy+0xd>:    push   %r12
  <qh_destroy+0xf>:    mov    %rdi,%r12
...
drivers/usb/host/ehci-mem.c: 68
  <qh_destroy+0x17>:   mov    %rsi,%rbx

so:

  R12: struct ehci_hcd ffff8808419839d0          << ehci
  RBX: struct ehci_qh  ffff88085202b570          << qh

crash> struct ehci_hcd ffff8808419839d0 | grep async 
  ...
  async = 0xffff88085202b570,                    << ehci->async
  num_async = 0x1,
  ...

crash> struct ehci_qh 0xffff88085202b570
struct ehci_qh {
  hw = 0xffff88007ed0b000, 
  qh_dma = 0x7ed0b000, 
  qh_next = {
    qh = 0xffff88083d6b4558,                     << ehci->async->qh_next.qh
    itd = 0xffff88083d6b4558, 
    sitd = 0xffff88083d6b4558, 
    fstn = 0xffff88083d6b4558, 
    hw_next = 0xffff88083d6b4558, 
    ptr = 0xffff88083d6b4558
  }, 
  qtd_list = {
    next = 0xffff88085202b588, 
    prev = 0xffff88085202b588
  }, 
  intr_node = {
    next = 0x0, 
    prev = 0x0
  }, 
  dummy = 0xffff88007895f000, 
  unlink_node = {
    next = 0xffff88085202b5b0, 
    prev = 0xffff88085202b5b0
  }, 
  unlink_cycle = 0x0, 
  qh_state = 0x1, 
  xacterrs = 0x0, 
  usecs = 0x0, 
  gap_uf = 0x0, 
  c_usecs = 0x0, 
  tt_usecs = 0x0, 
  period = 0x0, 
  start = 0x0, 
  dev = 0x0, 
  is_out = 0x0, 
  clearing_tt = 0x0, 
  dequeue_during_giveback = 0x0, 
  exception = 0x0
}

crash> struct ehci_qh 0xffff88083d6b4558         << *(ehci->async->qh_next.qh)
struct ehci_qh {
  hw = 0x0,
  qh_dma = 0x0, 
  qh_next = {
    qh = 0x0, 
    itd = 0x0, 
    sitd = 0x0, 
    fstn = 0x0, 
    hw_next = 0x0, 
    ptr = 0x0
  }, 
  qtd_list = {
    next = 0x0, 
    prev = 0x0
  }, 
  intr_node = {
    next = 0x0, 
    prev = 0x0
  }, 
  dummy = 0x0, 
  unlink_node = {
    next = 0x0, 
    prev = 0x0
  }, 
  unlink_cycle = 0x6, 
  qh_state = 0x0, 
  xacterrs = 0x0, 
  usecs = 0x0, 
  gap_uf = 0x0, 
  c_usecs = 0x0, 
  tt_usecs = 0x0, 
  period = 0x0, 
  start = 0x0, 
  dev = 0x6, 
  is_out = 0x0, 
  clearing_tt = 0x0, 
  dequeue_during_giveback = 0x0, 
  exception = 0x0
}

Regards,

-- Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux