Hi Alan, I think I see what's going on. Permit me to comment on your explanation of urb->use_count first, since it's relevant later on. > Here's the story: > > A new URB has its use_count set to 0 by usb_init_urb. > > The use_count is incremented when the URB is submitted, > in usb_hcd_submit_urb. (If the submission fails then > the use_count is decremented again.) > > The use_count is decremented after the completion routine > returns, in usb_hcd_giveback_urb. > > This means that an active URB should never have its use_count equal to > 0. The value is normally 1; it can go up to 2 temporarily if the > completion routine resubmits the URB. There's an additional state here that you're not mentioning - a URB which has been submitted, serviced, but then *not* rescheduled in the completion handler. In this case the use_count will drop back down to zero. Now in many drivers the completion handler doesn't resubmit the URB because of some error condition (such as the urb->status being ENOENT because some thread of execution called usb_kill_urb()), or in other drivers the completion handler will blindly call usb_submit_urb() regardless of urb->status but the call will fail (again for example because somebody called usb_kill_urb() and thus urb->reject was set to 1). However, there's a third condition which can cause the completion handler to not resubmit the URB, and that's because the driver has a flag which simply says not to (the au0828 driver does this to stop streaming). Hence you can have a case where the URB completion handler was called with (urb->status == 0) but the URB will not still be resubmitted, and thus when usb_kill_urb() is eventually called against the URB, the use_count is going to be zero when entering the function. Hence the actual lifecycle is: create: 0 submit: 1 complete: - if completion handler calls usb_submit_urb() then 2 during usb_submit_urb(), then back to 1 after completion returns. - if completion handler doesn't call usb_submit_urb(), then 0 after completion returns. Why does this matter? Asynchronous unlinking. But let's first look at how usb_kill_urb() works 99.99% of the time, which is against a URB which has been submitted and is actively being serviced: 1. A task calls usb_kill_urb() against a URB with use count of 1 (or sometimes 2). 2. usb_kill_urb() increments urb->reject so any attempt to resubmit the URB fails 3. usb_kill_urb() calls usb_hcd_unlink_urb() *** 4. usb_kill_urb() spins on &usb_kill_urb_queue waiting for use_count to reach zero. 5. usb_kill_urb() returns to the caller, which typically then calls usb_free_urb(urb). Between steps 3 and 4, the following things occur: 3a. usb_hcd_unlink_urb() sees urb->use_count is nonzero and calls unlink1() against the URB 3b. unlink1() calls ehci_urb_dequeue(), which calls usb_hcd_check_unlink_urb, which sets the urb->unlinked field to -ENOENT 3c. ehci_urb_dequeue() calls start_unlink_async() for the QH tied to the URB Sometime while usb_kill_urb() is spinning on urb->use_count, the IRQ handler eventually fires: 3d. ehci_irq() calls end_unlink_async() 3e. end_unlink_async() calls qh_completions() 3f. qh_completions() calls ehci_urb_done() 3g. ehci_urb_done() calls __usb_hcd_giveback_urb 3h. __usb_hcd_giveback_urb calls the completion handler for the URB 3i. Because the URB isn't resubmitted, use_count drops to zero and wake_up(&usb_kill_urb_queue) is called. === For the failure case, I see the completion handler firing for the URB, but the URB isn't being resubmitted because of the flag in the completion handler. Then when the kill comes around, the urb->use_count is zero and thus it returns immediately rather than blocking on the wait queue. But this doesn't change the fact that the end_unlink_async() hasn't necessarily been invoked yet for the URB. When the end_unlink_async() does finally do its housekeeping and call ehci_urb_done(), it's being called against a URB that has already been freed. The problem is exacerbated by the fact that I immediately call usb_free_urb() right after the usb_kill_urb() call for a given URB, and hence create a larger window of opportunity to hit the use-after-free condition. Even a short sleep would likely allow for enough time after the usb_kill_urb() call to allow end_unlink_async() to finish its work before the underlying URB was freed. Below is a trace which shows the crash, and how during the completion handler the use count is good for URB ebc54410, but because the URB won't be resubmitted it drops to zero by the time the usb_kill_urb() is called (note, URB ebc54410 is the second to last URB killed before the crash). You can also see at the time the completion handler is called for URB ebc54410 that while I've set the flag that the completion handler looks for to return without resubmitting, I haven't actually issued the usb_kill_urb() call, which is why the status is 0 instead of -2. One other data point worth mentioning: The current au0828 driver code does the following to shutdown the stream: for (i = 0; i < URB_COUNT; i++) { usb_kill_urb(dev->urbs[i]) kfree(dev->urbs[i]->transfer_buffer) usb_free(dev->urbs[i]) } If I change the code as follows, I completely avoid the race condition: for (i = 0; i < URB_COUNT; i++) { usb_kill_urb(dev->urbs[i]) } for (i = 0; i < URB_COUNT; i++) { kfree(dev->urbs[i]->transfer_buffer) usb_free(dev->urbs[i]) } By killing all the URBs before freeing any of them, I avoid the possibility of end_unlink_async() being invoked against a free'd URB. This is obviously a lousy hack, but it was good enough to roll out since I've been unable to reproduce the issue after four hours of continuous testing (whereas before the patch I could cause a crash in a manner of minutes). There's clearly a reference counting problem if the IRQ handler is still referring to qh->urb entries in its list, but the underlying URB is no longer in a submitted state. That said, I don't really know enough about the internals of the EHCI controller model to know what the right fix is. [ 3958.875980] au0828/0: start_urb_transfer() [ 3958.877332] au0828/0: start_urb_transfer()done ret=0 [ 3963.088348] au0828/0: au0828_dvb_stop_feed() [ 3963.088366] au0828/0: au0828_dvb_stop_feed(), start_count: 2646, stop_count: 2646 [ 3963.088373] au0828/0: stop_urb_transfer() [ 3963.088382] au0828/0: killing urb ebc540d0 dev=f69ab4d0 ep=f73ee500 uc=1 [ 3963.090978] au0828/0: urb_completion() urb=ebc540d0 use_count=1 status=-2 [ 3963.091030] au0828/0: urb_completion() urb=ebc54410 use_count=1 status=0 [ 3963.091113] au0828/0: killing urb ebc54dd0 dev=f69ab4d0 ep=f73ee500 uc=1 [ 3963.093815] au0828/0: urb_completion() urb=ebc54dd0 use_count=1 status=-2 [ 3963.093880] au0828/0: killing urb ebc548f0 dev=f69ab4d0 ep=f73ee500 uc=1 [ 3963.096985] au0828/0: urb_completion() urb=ebc548f0 use_count=1 status=-2 [ 3963.097103] au0828/0: killing urb ebc54680 dev=f69ab4d0 ep=f73ee500 uc=1 [ 3963.099803] au0828/0: urb_completion() urb=ebc54680 use_count=1 status=-2 [ 3963.099849] au0828/0: killing urb ebc54750 dev=f69ab4d0 ep=f73ee500 uc=1 [ 3963.102817] au0828/0: urb_completion() urb=ebc54750 use_count=1 status=-2 [ 3963.102918] au0828/0: killing urb ebc54000 dev=f69ab4d0 ep=f73ee500 uc=1 [ 3963.105803] au0828/0: urb_completion() urb=ebc54000 use_count=1 status=-2 [ 3963.105851] au0828/0: killing urb ebc544e0 dev=f69ab4d0 ep=f73ee500 uc=1 [ 3963.108824] au0828/0: urb_completion() urb=ebc544e0 use_count=1 status=-2 [ 3963.108932] au0828/0: killing urb ebc54c30 dev=f69ab4d0 ep=f73ee500 uc=1 [ 3963.111806] au0828/0: urb_completion() urb=ebc54c30 use_count=1 status=-2 [ 3963.111853] au0828/0: killing urb ebc54270 dev=f69ab4d0 ep=f73ee500 uc=1 [ 3963.114821] au0828/0: urb_completion() urb=ebc54270 use_count=1 status=-2 [ 3963.114921] au0828/0: killing urb ebc54820 dev=f69ab4d0 ep=f73ee500 uc=1 [ 3963.117806] au0828/0: urb_completion() urb=ebc54820 use_count=1 status=-2 [ 3963.117854] au0828/0: killing urb ebc54a90 dev=f69ab4d0 ep=f73ee500 uc=1 [ 3963.120828] au0828/0: urb_completion() urb=ebc54a90 use_count=1 status=-2 [ 3963.120931] au0828/0: killing urb ebc545b0 dev=f69ab4d0 ep=f73ee500 uc=1 [ 3963.123809] au0828/0: urb_completion() urb=ebc545b0 use_count=1 status=-2 [ 3963.123855] au0828/0: killing urb ebc54340 dev=f69ab4d0 ep=f73ee500 uc=1 [ 3963.126824] au0828/0: urb_completion() urb=ebc54340 use_count=1 status=-2 [ 3963.126923] au0828/0: killing urb ebc54b60 dev=f69ab4d0 ep=f73ee500 uc=1 [ 3963.129810] au0828/0: urb_completion() urb=ebc54b60 use_count=1 status=-2 [ 3963.129857] au0828/0: killing urb ebc54410 dev=f69ab4d0 ep=f73ee500 uc=0 [ 3963.129871] au0828/0: killing urb ebc549c0 dev=f69ab4d0 ep=f73ee500 uc=1 [ 3963.132860] BUG: unable to handle kernel paging request at 6b6b6b6f [ 3963.133060] IP: [<c149bc1e>] usb_hcd_unlink_urb_from_ep+0x1e/0x40 [ 3963.133230] *pdpt = 000000002e8c2001 *pde = 0000000000000000 [ 3963.133384] Oops: 0002 [#1] SMP [ 3963.133480] Modules linked in: cuse nfsd(F) auth_rpcgss(F) nfs_acl(F) nfs(F) lockd(F) sunrpc(F) fscache(F) snd_usb_audio snd_usbmidi_lib au8522_dig(OF) snd_hda_codec_realtek snd_hda_codec_hdmi btusb joydev(F) xc5000(OF) tuner(OF) bluetooth au8522_decoder(OF) au8522_common(OF) hid_generic kvm_amd(F) kvm(F) snd_hda_intel microcode(F) snd_hda_codec snd_hwdep(F) radeon snd_pcm(F) arc4(F) iwldvm snd_page_alloc(F) mac80211 k10temp snd_seq_midi(F) snd_seq_midi_event(F) au0828(OF) snd_rawmidi(F) tveeprom(OF) videobuf_vmalloc(OF) videobuf_core(OF) v4l2_common(OF) ttm iwlwifi dvb_core(OF) usbhid videodev(OF) media(OF) hid snd_seq(F) drm_kms_helper snd_seq_device(F) snd_timer(F) drm sp5100_tco ohci_pci snd(F) i2c_piix4 cfg80211 i2c_algo_bit soundcore(F) mac_hid lp(F) parport(F) r8169 mii(F) ahci(F) libahci(F) [ 3963.135611] [ 3963.135629] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GF O 3.12.0-031200rc7-generic #201310271935 [ 3963.135851] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Inagua CRB, BIOS 4.6.5 02/12/2014 [ 3963.136081] task: f70ce800 ti: f7108000 task.ti: f70c6000 [ 3963.136215] EIP: 0060:[<c149bc1e>] EFLAGS: 00210046 CPU: 1 [ 3963.136352] EIP is at usb_hcd_unlink_urb_from_ep+0x1e/0x40 [ 3963.136487] EAX: ebc54424 EBX: ebc54410 ECX: 6b6b6b6b EDX: 6b6b6b6b [ 3963.136638] ESI: ebc54410 EDI: f6849860 EBP: f7109e50 ESP: f7109e4c [ 3963.136789] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 [ 3963.136921] CR0: 8005003b CR2: 6b6b6b6f CR3: 2e821000 CR4: 000007f0 [ 3963.137071] Stack: [ 3963.137124] f6849a64 f7109e68 c14ae94a ffffff8d f6872ea0 ebcb3edc ebc26930 f7109ecc [ 3963.137364] c14af863 36872d80 00000000 00200082 000f4240 00000007 f68499e0 f7109ea0 [ 3963.137601] c1075a26 000f4240 03010000 ebcb3e40 f6872ea0 ebc26940 00000001 f68499a8 [ 3963.137839] Call Trace: [ 3963.137913] [<c14ae94a>] ehci_urb_done+0x4a/0x90 [ 3963.138035] [<c14af863>] qh_completions+0x203/0x580 [ 3963.138164] [<c1075a26>] ? hrtimer_start_range_ns+0x26/0x30 [ 3963.138309] [<c14b0738>] end_unlink_async+0x108/0x1f0 [ 3963.138441] [<c14b3141>] ehci_irq+0xf1/0x430 [ 3963.138553] [<c107b96f>] ? __wake_up+0x3f/0x50 [ 3963.138671] [<c13ee480>] ? add_interrupt_randomness+0x120/0x170 [ 3963.138821] [<c13ec8b9>] ? __mix_pool_bytes+0x39/0x80 [ 3963.138951] [<c13ee4a9>] ? add_interrupt_randomness+0x149/0x170 [ 3963.139100] [<c13ee4a9>] ? add_interrupt_randomness+0x149/0x170 [ 3963.139251] [<c149cb23>] usb_hcd_irq+0x33/0x50 [ 3963.139368] [<c10a27d5>] handle_irq_event_percpu+0x35/0x1a0 [ 3963.139513] [<c103bf5d>] ? __unmask_ioapic+0x2d/0x40 [ 3963.139641] [<c10a2971>] handle_irq_event+0x31/0x50 [ 3963.139769] [<c10a52a0>] ? unmask_irq+0x30/0x30 [ 3963.139887] [<c10a52ee>] handle_fasteoi_irq+0x4e/0xe0 [ 3963.140012] <IRQ> [ 3963.140067] [ 3963.140119] [<c164ceac>] ? do_IRQ+0x3c/0xb0 [ 3963.140199] [<c164cc73>] ? common_interrupt+0x33/0x38 [ 3963.140334] [<c152375e>] ? cpuidle_enter_state+0x3e/0xd0 [ 3963.140471] [<c152388e>] ? cpuidle_idle_call+0x9e/0x1d0 [ 3963.140606] [<c10173dd>] ? arch_cpu_idle+0xd/0x30 [ 3963.140729] [<c10a1eab>] ? cpu_startup_entry+0x9b/0x200 [ 3963.140865] [<c1039a5b>] ? setup_APIC_timer+0xab/0x130 [ 3963.140997] [<c1038188>] ? start_secondary+0x208/0x2d0 [ 3963.147008] Code: eb d6 83 cb ff eb d1 8d b6 00 00 00 00 55 89 e5 53 3e 8d 74 26 00 b8 26 8b b5 c1 89 d3 e8 ab 95 1a 00 8b 4b 14 8d 43 14 8b 53 18 <89> 51 04 89 0a 89 43 14 89 43 18 b8 26 8b b5 c1 e8 1d 95 1a 00 [ 3963.160167] EIP: [<c149bc1e>] usb_hcd_unlink_urb_from_ep+0x1e/0x40 SS:ESP 0068:f7109e4c [ 3963.166670] CR2: 000000006b6b6b6f -- Devin J. Heitmueller - Kernel Labs http://www.kernellabs.com -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html