I've come across what appears to be another xHCI issue - attempting to format a disk with gparted is causing a kernel Oops. This may not be related to the issue you're currently investigating, but wanted to pass it on in case it is (if it isn't let me know and I'll either keep quiet or raise it separately, whatever you prefer). I can easily reproduce the crash running 3.19rc6 with Mathias additional error and debugging messages (debugging switched off). I wasn't able to reproduce the issue with xhci debugging enabled, i.e.: echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control Just enabling debugging for xhci-ring.c, i.e.: echo -n 'file drivers/usb/host/xhci-ring.c =p' > /sys/kernel/debug/dynamic_debug/control allowed me to catch the crash with a bit more info. I've attached the full log at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1412121/+attachment/4306068/+files/syslog-3.19rc6-m0.1-gparted.log The end of the log is: Jan 26 22:33:27 alistair-XPS13 kernel: [ 908.402917] xhci_hcd 0000:00:14.0: Stalled endpoint Jan 26 22:33:27 alistair-XPS13 kernel: [ 908.402933] xhci_hcd 0000:00:14.0: cmdring ctrl reg before ringing 0x8 Jan 26 22:33:27 alistair-XPS13 kernel: [ 908.402937] xhci_hcd 0000:00:14.0: // Ding dong! Jan 26 22:33:27 alistair-XPS13 kernel: [ 908.402943] xhci_hcd 0000:00:14.0: cmdring ctrl reg after ringing 0x8 Jan 26 22:33:27 alistair-XPS13 kernel: [ 908.402949] xhci_hcd 0000:00:14.0: Giveback URB ffff88020554dcc0, len = 0, expected = 512, status = -32 Jan 26 22:33:28 alistair-XPS13 kernel: [ 909.450041] xhci_hcd 0000:00:14.0: Stalled endpoint Jan 26 22:33:28 alistair-XPS13 kernel: [ 909.450049] xhci_hcd 0000:00:14.0: cmdring ctrl reg before ringing 0x8 Jan 26 22:33:28 alistair-XPS13 kernel: [ 909.450050] xhci_hcd 0000:00:14.0: // Ding dong! Jan 26 22:33:28 alistair-XPS13 kernel: [ 909.450053] xhci_hcd 0000:00:14.0: cmdring ctrl reg after ringing 0x8 Jan 26 22:33:28 alistair-XPS13 kernel: [ 909.450055] xhci_hcd 0000:00:14.0: Giveback URB ffff88020554d3c0, len = 0, expected = 512, status = -32 Jan 26 22:33:28 alistair-XPS13 kernel: [ 909.450256] xhci_hcd 0000:00:14.0: Stalled endpoint Jan 26 22:33:28 alistair-XPS13 kernel: [ 909.450258] xhci_hcd 0000:00:14.0: cmdring ctrl reg before ringing 0x8 Jan 26 22:33:28 alistair-XPS13 kernel: [ 909.450259] xhci_hcd 0000:00:14.0: // Ding dong! Jan 26 22:33:28 alistair-XPS13 kernel: [ 909.450261] xhci_hcd 0000:00:14.0: cmdring ctrl reg after ringing 0x8 Jan 26 22:33:28 alistair-XPS13 kernel: [ 909.450263] xhci_hcd 0000:00:14.0: Giveback URB ffff880203f6f000, len = 0, expected = 13, status = -32 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.174704] xhci_hcd 0000:00:14.0: Starting stop cmd watchdog timer for slot 6 ep index 2. Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.174713] xhci_hcd 0000:00:14.0: cmdring ctrl reg before ringing 0x8 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.174716] xhci_hcd 0000:00:14.0: // Ding dong! Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.174721] xhci_hcd 0000:00:14.0: cmdring ctrl reg after ringing 0x8 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.174728] xhci_hcd 0000:00:14.0: Stopped on Transfer TRB Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.174738] xhci_hcd 0000:00:14.0: cmdring ctrl reg before ringing 0x8 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.174739] xhci_hcd 0000:00:14.0: // Ding dong! Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.174743] xhci_hcd 0000:00:14.0: cmdring ctrl reg after ringing 0x8 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.229892] xhci_hcd 0000:00:14.0: Port Status Change Event for port 2 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.229901] xhci_hcd 0000:00:14.0: handle_port_status: starting port polling. Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.286659] xhci_hcd 0000:00:14.0: cmdring ctrl reg before ringing 0x8 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.286666] xhci_hcd 0000:00:14.0: // Ding dong! Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.286672] xhci_hcd 0000:00:14.0: cmdring ctrl reg after ringing 0x8 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.286817] xhci_hcd 0000:00:14.0: Completed reset device command. Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.286871] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.286955] IP: [<ffffffff815effed>] xhci_discover_or_reset_device+0x33d/0x520 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.287030] PGD 0 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.287054] Oops: 0000 [#1] SMP Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.287090] Modules linked in: nls_iso8859_1 uas usb_storage ctr ccm pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables arc4 iwlmvm mac80211 dm_crypt rpcsec_gss_krb5 nfsv4 intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm iwlwifi uvcvideo videobuf2_vmalloc videobuf2_memops snd_hda_codec_realtek snd_hda_codec_hdmi videobuf2_core crct10dif_pclmul nfsd crc32_pclmul v4l2_common snd_hda_codec_generic ghash_clmulni_intel snd_soc_rt5640 videodev cfg80211 snd_soc_rl6231 snd_soc_core snd_hda_intel snd_hda_controller snd_hda_codec aesni_intel auth_rpcgss joydev hid_rmi aes_x86_64 snd_compress snd_hwdep snd_pcm_dmaengine dell_laptop lrw nfs_acl dell_wmi media sparse_keymap snd_pcm nfs gf128mul dcdbas glue_helper snd_seq_midi ablk_helper snd_seq_midi_event cryptd snd_rawmidi serio_raw snd_seq btusb rfcomm hid_multitouch bnep snd_seq_device lockd snd_timer mei_me bluetooth mei snd i2c_hid dw_dmac grace shpchp sunrpc dw_dmac_core lpc_ich soundcore 8250_dw i2c_designware_platform snd_soc_sst_acpi spi_pxa2xx_platform i2c_designware_core mac_hid intel_smartconnect intel_rst fscache parport_pc ppdev coretemp lp parport hid_generic i915 i2c_algo_bit drm_kms_helper usbhid ahci psmouse hid libahci sdhci_acpi drm wmi video sdhci Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.287884] CPU: 1 PID: 5682 Comm: usb-storage Tainted: G OE 3.19.0-rc6akg-m0.1+ #4 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.287932] Hardware name: Dell Inc. XPS13 9333/099TN4, BIOS A06 11/07/2014 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.287971] task: ffff8802066e62c0 ti: ffff8802067b4000 task.ti: ffff8802067b4000 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288012] RIP: 0010:[<ffffffff815effed>] [<ffffffff815effed>] xhci_discover_or_reset_device+0x33d/0x520 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288069] RSP: 0018:ffff8802067b7af8 EFLAGS: 00010246 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288099] RAX: 0000000000000000 RBX: ffff8802137aa000 RCX: 0000000000000000 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288137] RDX: 0000000000000040 RSI: ffff8802066e62c0 RDI: ffff8802055eedc8 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288176] RBP: ffff8802067b7b48 R08: ffff8802067b4000 R09: 0000000000000006 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288214] R10: 0000000000000003 R11: 0000000000000020 R12: ffff88021558a000 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288253] R13: 0000000000000002 R14: 0000000000000001 R15: ffff8802137aa1bc Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288292] FS: 0000000000000000(0000) GS:ffff88021f240000(0000) knlGS:0000000000000000 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288336] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288368] CR2: 0000000000000040 CR3: 0000000001c13000 CR4: 00000000001407e0 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288407] Stack: Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288419] ffff8800b3e07800 0000000100000000 ffff880206721140 ffff8802137abbfc Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288463] 0000000000000000 ffff8800ad788c00 0000000000000000 0000000000000000 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288507] 0000000000000002 0000000000000032 ffff8802067b7bb8 ffffffff815a0dd9 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288552] Call Trace: Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288573] [<ffffffff815a0dd9>] hub_port_reset+0x379/0x5e0 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288607] [<ffffffff815a10b5>] hub_port_init+0x75/0xca0 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288642] [<ffffffff810a0b70>] ? wake_up_state+0x20/0x20 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288675] [<ffffffff815ab735>] ? usb_hcd_reset_endpoint+0x25/0x70 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288712] [<ffffffff815aeb45>] ? usb_enable_endpoint+0x85/0x90 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288747] [<ffffffff815a1dfd>] usb_reset_and_verify_device+0x11d/0x760 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288787] [<ffffffff815a25bb>] usb_reset_device+0x17b/0x2c0 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288825] [<ffffffffc0ac9132>] usb_stor_port_reset+0x62/0x70 [usb_storage] Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288867] [<ffffffffc0ac91d4>] usb_stor_invoke_transport+0x94/0x610 [usb_storage] Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288914] [<ffffffff817b4dd4>] ? wait_for_completion_interruptible+0xc4/0x2d0 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.288957] [<ffffffffc0ac7d7e>] usb_stor_transparent_scsi_command+0xe/0x10 [usb_storage] Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.289005] [<ffffffffc0acada9>] usb_stor_control_thread+0x209/0x3a0 [usb_storage] Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.289050] [<ffffffffc0acaba0>] ? usb_stor_disconnect+0x130/0x130 [usb_storage] Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.289094] [<ffffffff8109337b>] kthread+0xdb/0x100 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.289124] [<ffffffff810932a0>] ? kthread_create_on_node+0x180/0x180 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.289163] [<ffffffff817b877c>] ret_from_fork+0x7c/0xb0 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.289195] [<ffffffff810932a0>] ? kthread_create_on_node+0x180/0x180 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.289231] Code: ff ff 49 39 87 6c ff ff ff 74 15 49 8b 04 24 48 c7 c6 c8 65 b2 81 48 8b 38 31 c0 e8 de f1 ef ff 49 8b 87 4c ff ff ff 48 8d 50 40 <48> 39 50 40 74 1c 49 8b 04 24 48 c7 c6 f8 65 b2 81 48 8b 38 31 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.289400] RIP [<ffffffff815effed>] xhci_discover_or_reset_device+0x33d/0x520 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.289443] RSP <ffff8802067b7af8> Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.289463] CR2: 0000000000000040 Jan 26 22:33:59 alistair-XPS13 kernel: [ 940.301795] ---[ end trace 6ee84852c606375e ]--- Hope this helps, Alistair On Mon, Jan 26, 2015 at 12:58 PM, Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> wrote: > On 26.01.2015 05:37, Devin Heitmueller wrote: >> Hi Mathias, >> >> Here's an interesting development: as a result of a related thread on >> linux-media, I came across a patch they are distributing in openelec: >> >> https://github.com/OpenELEC/OpenELEC.tv/commit/b636927dec20652ff020e54ed7838a2e9be51e03 >> >> Now I'm not saying that reverting the commit in question is the >> "right" thing to do, but I applied this patch and for the first time >> in 100+ tests it started to work (i.e. I'm not seeing the XHCI hcd >> tear down all the attached devices). >> >> Given what I've seen of the bug I cannot really explain why the >> scatter gather list sizes would have any bearing on TRBs for USB >> control messages to be added to the queue. Perhaps we're hitting the >> upper bound of the list? Any further speculation on my part would >> just make me look clueless... >> >> It would be great if you could offer any insight as to why the patch >> in question could be responsible for the behavior we're seeing. I >> would really rather not just blindly check this patch into my local >> tree and declare "victory" without understanding the underlying issue >> and whether it's likely to cause other problems. >> > > Hi > > Thanks for digging this info out. > > I'm starting to think maybe there could be something wrong in the ring expansion, > or ring memory management in general. > > The failed control endpoint stopping was preceeded by a ring expansion, (in all the logs I got) > If the ring expansion already kills xhci it would explain why we never see the control message on the bus. > > The patch you show increases the TRB_PER_SEGMENT size from 64 to 256, this again reduces the need for ring expansion > as each segment is already four times bigger initially. > > Another thing pointing to ring memory management is the additional checks I added printing: > > "Cancelled TD not on stopped ring" > "Cancel URB NOT on current ring" > > Hitting these means that the the urb we want to cancel points to a segment that is no longer part of that endpoints ring. > But we still alter the memory of that segment. > We cache, free, and re-use segments as endpoints are dropped and added, or rings expanded. > We might have randomly altered a segment that just got used by another ring after a ring expansion. > > Atleast thats a new theory. > I need to do more hacking > > -Mathias > > -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html