Re: URB IRQ fires on URB after usb_kill_urb() already completed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Alan,

I think I see what's going on.  Permit me to comment on your
explanation of urb->use_count first, since it's relevant later on.

> Here's the story:
>
>         A new URB has its use_count set to 0 by usb_init_urb.
>
>         The use_count is incremented when the URB is submitted,
>         in usb_hcd_submit_urb.  (If the submission fails then
>         the use_count is decremented again.)
>
>         The use_count is decremented after the completion routine
>         returns, in usb_hcd_giveback_urb.
>
> This means that an active URB should never have its use_count equal to
> 0.  The value is normally 1; it can go up to 2 temporarily if the
> completion routine resubmits the URB.

There's an additional state here that you're not mentioning - a URB
which has been submitted, serviced, but then *not* rescheduled in the
completion handler.  In this case the use_count will drop back down to
zero.

Now in many drivers the completion handler doesn't resubmit the URB
because of some error condition (such as the urb->status being ENOENT
because some thread of execution called usb_kill_urb()), or in other
drivers the completion handler will blindly call usb_submit_urb()
regardless of urb->status but the call will fail (again for example
because somebody called usb_kill_urb() and thus urb->reject was set to
1).  However, there's a third condition which can cause the completion
handler to not resubmit the URB, and that's because the driver has a
flag which simply says not to (the au0828 driver does this to stop
streaming).  Hence you can have a case where the URB completion
handler was called with (urb->status == 0) but the URB will not still
be resubmitted, and thus when usb_kill_urb() is eventually called
against the URB, the use_count is going to be zero when entering the
function.

Hence the actual lifecycle is:
create: 0
submit: 1
complete:
 - if completion handler calls usb_submit_urb() then 2 during
usb_submit_urb(), then back to 1 after completion returns.
 - if completion handler doesn't call usb_submit_urb(), then 0 after
completion returns.

Why does this matter?  Asynchronous unlinking.  But let's first look
at how usb_kill_urb() works 99.99% of the time, which is against a URB
which has been submitted and is actively being serviced:

1.  A task calls usb_kill_urb() against a URB with use count of 1 (or
sometimes 2).
2.  usb_kill_urb() increments urb->reject so any attempt to resubmit
the URB fails
3.  usb_kill_urb() calls usb_hcd_unlink_urb()
***
4.  usb_kill_urb() spins on &usb_kill_urb_queue waiting for use_count
to reach zero.
5.  usb_kill_urb() returns to the caller, which typically then calls
usb_free_urb(urb).

Between steps 3 and 4, the following things occur:
3a.  usb_hcd_unlink_urb() sees urb->use_count is nonzero and calls
unlink1() against the URB
3b.  unlink1() calls ehci_urb_dequeue(), which calls
usb_hcd_check_unlink_urb, which sets the urb->unlinked field to
-ENOENT
3c.  ehci_urb_dequeue() calls start_unlink_async() for the QH tied to the URB

Sometime while usb_kill_urb() is spinning on urb->use_count, the IRQ
handler eventually fires:
3d.  ehci_irq() calls end_unlink_async()
3e.  end_unlink_async() calls qh_completions()
3f.  qh_completions() calls ehci_urb_done()
3g.  ehci_urb_done() calls __usb_hcd_giveback_urb
3h.   __usb_hcd_giveback_urb calls the completion handler for the URB
3i. Because the URB isn't resubmitted, use_count drops to zero and
wake_up(&usb_kill_urb_queue) is called.

===

For the failure case, I see the completion handler firing for the URB,
but the URB isn't being resubmitted because of the flag in the
completion handler.  Then when the kill comes around, the
urb->use_count is zero and thus it returns immediately rather than
blocking on the wait queue.  But this doesn't change the fact that the
end_unlink_async() hasn't necessarily been invoked yet for the URB.
When the end_unlink_async() does finally do its housekeeping and call
ehci_urb_done(), it's being called against a URB that has already been
freed.

The problem is exacerbated by the fact that I immediately call
usb_free_urb() right after the usb_kill_urb() call for a given URB,
and hence create a larger window of opportunity to hit the
use-after-free condition.  Even a short sleep would likely allow for
enough time after the usb_kill_urb() call to allow end_unlink_async()
to finish its work before the underlying URB was freed.

Below is a trace which shows the crash, and how during the completion
handler the use count is good for URB ebc54410, but because the URB
won't be resubmitted it drops to zero by the time the usb_kill_urb()
is called (note, URB ebc54410 is the second to last URB killed before
the crash).  You can also see at the time the completion handler is
called for URB ebc54410 that while I've set the flag that the
completion handler looks for to return without resubmitting, I haven't
actually issued the usb_kill_urb() call, which is why the status is 0
instead of -2.

One other data point worth mentioning:  The current au0828 driver code
does the following to shutdown the stream:

for (i = 0; i < URB_COUNT; i++) {
  usb_kill_urb(dev->urbs[i])
  kfree(dev->urbs[i]->transfer_buffer)
  usb_free(dev->urbs[i])
}

If I change the code as follows, I completely avoid the race condition:

for (i = 0; i < URB_COUNT; i++) {
  usb_kill_urb(dev->urbs[i])
}
for (i = 0; i < URB_COUNT; i++) {
  kfree(dev->urbs[i]->transfer_buffer)
  usb_free(dev->urbs[i])
}

By killing all the URBs before freeing any of them, I avoid the
possibility of end_unlink_async() being invoked against a free'd URB.
This is obviously a lousy hack, but it was good enough to roll out
since I've been unable to reproduce the issue after four hours of
continuous testing (whereas before the patch I could cause a crash in
a manner of minutes).

There's clearly a reference counting problem if the IRQ handler is
still referring to qh->urb entries in its list, but the underlying URB
is no longer in a submitted state.  That said, I don't really know
enough about the internals of the EHCI controller model to know what
the right fix is.

[ 3958.875980] au0828/0: start_urb_transfer()
[ 3958.877332] au0828/0: start_urb_transfer()done ret=0
[ 3963.088348] au0828/0: au0828_dvb_stop_feed()
[ 3963.088366] au0828/0: au0828_dvb_stop_feed(), start_count: 2646,
stop_count: 2646
[ 3963.088373] au0828/0: stop_urb_transfer()
[ 3963.088382] au0828/0: killing urb ebc540d0 dev=f69ab4d0 ep=f73ee500 uc=1
[ 3963.090978] au0828/0: urb_completion() urb=ebc540d0 use_count=1 status=-2
[ 3963.091030] au0828/0: urb_completion() urb=ebc54410 use_count=1 status=0
[ 3963.091113] au0828/0: killing urb ebc54dd0 dev=f69ab4d0 ep=f73ee500 uc=1
[ 3963.093815] au0828/0: urb_completion() urb=ebc54dd0 use_count=1 status=-2
[ 3963.093880] au0828/0: killing urb ebc548f0 dev=f69ab4d0 ep=f73ee500 uc=1
[ 3963.096985] au0828/0: urb_completion() urb=ebc548f0 use_count=1 status=-2
[ 3963.097103] au0828/0: killing urb ebc54680 dev=f69ab4d0 ep=f73ee500 uc=1
[ 3963.099803] au0828/0: urb_completion() urb=ebc54680 use_count=1 status=-2
[ 3963.099849] au0828/0: killing urb ebc54750 dev=f69ab4d0 ep=f73ee500 uc=1
[ 3963.102817] au0828/0: urb_completion() urb=ebc54750 use_count=1 status=-2
[ 3963.102918] au0828/0: killing urb ebc54000 dev=f69ab4d0 ep=f73ee500 uc=1
[ 3963.105803] au0828/0: urb_completion() urb=ebc54000 use_count=1 status=-2
[ 3963.105851] au0828/0: killing urb ebc544e0 dev=f69ab4d0 ep=f73ee500 uc=1
[ 3963.108824] au0828/0: urb_completion() urb=ebc544e0 use_count=1 status=-2
[ 3963.108932] au0828/0: killing urb ebc54c30 dev=f69ab4d0 ep=f73ee500 uc=1
[ 3963.111806] au0828/0: urb_completion() urb=ebc54c30 use_count=1 status=-2
[ 3963.111853] au0828/0: killing urb ebc54270 dev=f69ab4d0 ep=f73ee500 uc=1
[ 3963.114821] au0828/0: urb_completion() urb=ebc54270 use_count=1 status=-2
[ 3963.114921] au0828/0: killing urb ebc54820 dev=f69ab4d0 ep=f73ee500 uc=1
[ 3963.117806] au0828/0: urb_completion() urb=ebc54820 use_count=1 status=-2
[ 3963.117854] au0828/0: killing urb ebc54a90 dev=f69ab4d0 ep=f73ee500 uc=1
[ 3963.120828] au0828/0: urb_completion() urb=ebc54a90 use_count=1 status=-2
[ 3963.120931] au0828/0: killing urb ebc545b0 dev=f69ab4d0 ep=f73ee500 uc=1
[ 3963.123809] au0828/0: urb_completion() urb=ebc545b0 use_count=1 status=-2
[ 3963.123855] au0828/0: killing urb ebc54340 dev=f69ab4d0 ep=f73ee500 uc=1
[ 3963.126824] au0828/0: urb_completion() urb=ebc54340 use_count=1 status=-2
[ 3963.126923] au0828/0: killing urb ebc54b60 dev=f69ab4d0 ep=f73ee500 uc=1
[ 3963.129810] au0828/0: urb_completion() urb=ebc54b60 use_count=1 status=-2
[ 3963.129857] au0828/0: killing urb ebc54410 dev=f69ab4d0 ep=f73ee500 uc=0
[ 3963.129871] au0828/0: killing urb ebc549c0 dev=f69ab4d0 ep=f73ee500 uc=1
[ 3963.132860] BUG: unable to handle kernel paging request at 6b6b6b6f
[ 3963.133060] IP: [<c149bc1e>] usb_hcd_unlink_urb_from_ep+0x1e/0x40
[ 3963.133230] *pdpt = 000000002e8c2001 *pde = 0000000000000000
[ 3963.133384] Oops: 0002 [#1] SMP
[ 3963.133480] Modules linked in: cuse nfsd(F) auth_rpcgss(F)
nfs_acl(F) nfs(F) lockd(F) sunrpc(F) fscache(F) snd_usb_audio
snd_usbmidi_lib au8522_dig(OF) snd_hda_codec_realtek
snd_hda_codec_hdmi btusb joydev(F) xc5000(OF) tuner(OF) bluetooth
au8522_decoder(OF) au8522_common(OF) hid_generic kvm_amd(F) kvm(F)
snd_hda_intel microcode(F) snd_hda_codec snd_hwdep(F) radeon
snd_pcm(F) arc4(F) iwldvm snd_page_alloc(F) mac80211 k10temp
snd_seq_midi(F) snd_seq_midi_event(F) au0828(OF) snd_rawmidi(F)
tveeprom(OF) videobuf_vmalloc(OF) videobuf_core(OF) v4l2_common(OF)
ttm iwlwifi dvb_core(OF) usbhid videodev(OF) media(OF) hid snd_seq(F)
drm_kms_helper snd_seq_device(F) snd_timer(F) drm sp5100_tco ohci_pci
snd(F) i2c_piix4 cfg80211 i2c_algo_bit soundcore(F) mac_hid lp(F)
parport(F) r8169 mii(F) ahci(F) libahci(F)
[ 3963.135611]
[ 3963.135629] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GF          O
3.12.0-031200rc7-generic #201310271935
[ 3963.135851] Hardware name: To be filled by O.E.M. To be filled by
O.E.M./Inagua CRB, BIOS 4.6.5 02/12/2014
[ 3963.136081] task: f70ce800 ti: f7108000 task.ti: f70c6000
[ 3963.136215] EIP: 0060:[<c149bc1e>] EFLAGS: 00210046 CPU: 1
[ 3963.136352] EIP is at usb_hcd_unlink_urb_from_ep+0x1e/0x40
[ 3963.136487] EAX: ebc54424 EBX: ebc54410 ECX: 6b6b6b6b EDX: 6b6b6b6b
[ 3963.136638] ESI: ebc54410 EDI: f6849860 EBP: f7109e50 ESP: f7109e4c
[ 3963.136789]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 3963.136921] CR0: 8005003b CR2: 6b6b6b6f CR3: 2e821000 CR4: 000007f0
[ 3963.137071] Stack:
[ 3963.137124]  f6849a64 f7109e68 c14ae94a ffffff8d f6872ea0 ebcb3edc
ebc26930 f7109ecc
[ 3963.137364]  c14af863 36872d80 00000000 00200082 000f4240 00000007
f68499e0 f7109ea0
[ 3963.137601]  c1075a26 000f4240 03010000 ebcb3e40 f6872ea0 ebc26940
00000001 f68499a8
[ 3963.137839] Call Trace:
[ 3963.137913]  [<c14ae94a>] ehci_urb_done+0x4a/0x90
[ 3963.138035]  [<c14af863>] qh_completions+0x203/0x580
[ 3963.138164]  [<c1075a26>] ? hrtimer_start_range_ns+0x26/0x30
[ 3963.138309]  [<c14b0738>] end_unlink_async+0x108/0x1f0
[ 3963.138441]  [<c14b3141>] ehci_irq+0xf1/0x430
[ 3963.138553]  [<c107b96f>] ? __wake_up+0x3f/0x50
[ 3963.138671]  [<c13ee480>] ? add_interrupt_randomness+0x120/0x170
[ 3963.138821]  [<c13ec8b9>] ? __mix_pool_bytes+0x39/0x80
[ 3963.138951]  [<c13ee4a9>] ? add_interrupt_randomness+0x149/0x170
[ 3963.139100]  [<c13ee4a9>] ? add_interrupt_randomness+0x149/0x170
[ 3963.139251]  [<c149cb23>] usb_hcd_irq+0x33/0x50
[ 3963.139368]  [<c10a27d5>] handle_irq_event_percpu+0x35/0x1a0
[ 3963.139513]  [<c103bf5d>] ? __unmask_ioapic+0x2d/0x40
[ 3963.139641]  [<c10a2971>] handle_irq_event+0x31/0x50
[ 3963.139769]  [<c10a52a0>] ? unmask_irq+0x30/0x30
[ 3963.139887]  [<c10a52ee>] handle_fasteoi_irq+0x4e/0xe0
[ 3963.140012]  <IRQ>
[ 3963.140067]
[ 3963.140119]  [<c164ceac>] ? do_IRQ+0x3c/0xb0
[ 3963.140199]  [<c164cc73>] ? common_interrupt+0x33/0x38
[ 3963.140334]  [<c152375e>] ? cpuidle_enter_state+0x3e/0xd0
[ 3963.140471]  [<c152388e>] ? cpuidle_idle_call+0x9e/0x1d0
[ 3963.140606]  [<c10173dd>] ? arch_cpu_idle+0xd/0x30
[ 3963.140729]  [<c10a1eab>] ? cpu_startup_entry+0x9b/0x200
[ 3963.140865]  [<c1039a5b>] ? setup_APIC_timer+0xab/0x130
[ 3963.140997]  [<c1038188>] ? start_secondary+0x208/0x2d0
[ 3963.147008] Code: eb d6 83 cb ff eb d1 8d b6 00 00 00 00 55 89 e5
53 3e 8d 74 26 00 b8 26 8b b5 c1 89 d3 e8 ab 95 1a 00 8b 4b 14 8d 43
14 8b 53 18 <89> 51 04 89 0a 89 43 14 89 43 18 b8 26 8b b5 c1 e8 1d 95
1a 00
[ 3963.160167] EIP: [<c149bc1e>] usb_hcd_unlink_urb_from_ep+0x1e/0x40
SS:ESP 0068:f7109e4c
[ 3963.166670] CR2: 000000006b6b6b6f


-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux