On Mon, 15 Sep 2014 11:59:15 -0400 Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote: > On Sat, 13 Sep 2014, Joe Lawrence wrote: > > > Hi Alan, > > > > I've collected 16 crashes since kicking off automated tests a little > > over 24 hrs ago. > > > > Each crash hit the BUG in qh_destroy() and the only unique debugging > > printk from ehci_stop() was: "ehci_stop: ehci->num_async = 0". > > What about error messages from the check_async_ring routine? In the crash instances, the error messages never occurred. (I had added a BUG() in that code block, just to be really clear if/when that event occurred). > You can check for this. Sprinkle ehci_info messages throughout > ehci_stop, printing the value of ehci->async->qh_next.qh. It should be > NULL the entire time. I re-ran the tests with the following changes: - moved num_async debug up above struct ehci_hcd.priv[0] :) - added verbose debug msg in check_async_ring to display the new num_async value - added ehci_info to every other line in ehci_stop % grep '0000:66:00.1' vmcore-dmesg.txt [ 4936.956739] pci 0000:66:00.1: [8086:1d2d] type 00 class 0x0c0320 [ 4936.956789] pci 0000:66:00.1: reg 0x10: [mem 0xa0000400-0xa00007ff] [ 4936.957019] pci 0000:66:00.1: PME# supported from D0 D3hot D3cold [ 4939.564933] ehci-pci 0000:66:00.1: EHCI Host Controller [ 4939.565462] ehci-pci 0000:66:00.1: new USB bus registered, assigned bus number 4 [ 4939.565498] ehci-pci 0000:66:00.1: debug port 2 [ 4939.569466] ehci-pci 0000:66:00.1: cache line size of 64 is not supported [ 4939.569533] ehci-pci 0000:66:00.1: irq 13, io mem 0xa0000400 [ 4939.574794] ehci-pci 0000:66:00.1: USB 2.0 started, EHCI 1.00 [ 4939.575410] usb usb4: SerialNumber: 0000:66:00.1 [ 4940.089044] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558 [ 4941.795700] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 0, qh= (null) [ 4941.895713] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558 [ 4941.943761] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 0, qh= (null) [ 4942.002715] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558 [ 4942.064346] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 0, qh= (null) [ 4942.156646] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558 [ 4942.226028] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 0, qh= (null) [ 4942.274185] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558 [ 4942.329196] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 0, qh= (null) [ 4942.391055] ehci-pci 0000:66:00.1: check_async_ring:970 ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558 [ 4942.629653] ehci-pci 0000:66:00.1: HC died; cleaning up [ 4945.663153] ehci-pci 0000:66:00.1: remove, state 1 [ 4945.733934] ehci-pci 0000:66:00.1: ehci_stop:420 ehci(ffff8808419839d0)->num_async = 1 [ 4945.774242] ehci-pci 0000:66:00.1: ehci_stop:427 ehci(ffff8808419839d0)->num_async = 1 [ 4945.814791] ehci-pci 0000:66:00.1: ehci_stop:429 ehci(ffff8808419839d0)->num_async = 1 [ 4945.855938] ehci-pci 0000:66:00.1: ehci_stop:431 ehci(ffff8808419839d0)->num_async = 1 [ 4945.897145] ehci-pci 0000:66:00.1: ehci_stop:433 ehci(ffff8808419839d0)->num_async = 1 [ 4945.938791] ehci-pci 0000:66:00.1: ehci_stop:436 ehci(ffff8808419839d0)->num_async = 1 [ 4945.980491] ehci-pci 0000:66:00.1: ehci_stop:438 ehci(ffff8808419839d0)->num_async = 1 [ 4946.022221] ehci-pci 0000:66:00.1: ehci_stop:440 ehci(ffff8808419839d0)->num_async = 1 [ 4946.063936] ehci-pci 0000:66:00.1: ehci_stop:444 ehci(ffff8808419839d0)->num_async = 1 [ 4946.105631] ehci-pci 0000:66:00.1: ehci_stop:446 ehci(ffff8808419839d0)->num_async = 1 [ 4946.241802] ehci-pci 0000:66:00.1: ehci_stop:448 ehci(ffff8808419839d0)->num_async = 1 where ehci_stop line 420 is function entry and line 448 function return. Notice that check_async_ring incremented num_async just before the HC died/removed and then num_async was 1 throughout ehci_stop. qh_destroy then BUGs with a struct ehci_hcd *ehci of ffff8808419839d0. -- Joe -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html