On Wed, 25 Feb 2015, Devin Heitmueller wrote: > Hi Alan, > > I think I see what's going on. Permit me to comment on your > explanation of urb->use_count first, since it's relevant later on. I won't go over this in great detail, because I think your proposed explanation is wrong. My impression is that we aren't looking in the right place. > 1. A task calls usb_kill_urb() against a URB with use count of 1 (or > sometimes 2). > 2. usb_kill_urb() increments urb->reject so any attempt to resubmit > the URB fails > 3. usb_kill_urb() calls usb_hcd_unlink_urb() > *** > 4. usb_kill_urb() spins on &usb_kill_urb_queue waiting for use_count > to reach zero. > 5. usb_kill_urb() returns to the caller, which typically then calls > usb_free_urb(urb). > > Between steps 3 and 4, the following things occur: > 3a. usb_hcd_unlink_urb() sees urb->use_count is nonzero and calls > unlink1() against the URB > 3b. unlink1() calls ehci_urb_dequeue(), which calls > usb_hcd_check_unlink_urb, which sets the urb->unlinked field to > -ENOENT > 3c. ehci_urb_dequeue() calls start_unlink_async() for the QH tied to the URB > > Sometime while usb_kill_urb() is spinning on urb->use_count, the IRQ > handler eventually fires: > 3d. ehci_irq() calls end_unlink_async() > 3e. end_unlink_async() calls qh_completions() > 3f. qh_completions() calls ehci_urb_done() > 3g. ehci_urb_done() calls __usb_hcd_giveback_urb > 3h. __usb_hcd_giveback_urb calls the completion handler for the URB > 3i. Because the URB isn't resubmitted, use_count drops to zero and > wake_up(&usb_kill_urb_queue) is called. > > === > > For the failure case, I see the completion handler firing for the URB, > but the URB isn't being resubmitted because of the flag in the > completion handler. Then when the kill comes around, the > urb->use_count is zero and thus it returns immediately rather than > blocking on the wait queue. But this doesn't change the fact that the > end_unlink_async() hasn't necessarily been invoked yet for the URB. > When the end_unlink_async() does finally do its housekeeping and call > ehci_urb_done(), it's being called against a URB that has already been > freed. This misses a crucial point: the completion handler is called (indirectly) by end_unlink_async(). Therefore end_unlink_async() _has_ necessarily already been invoked by the time usb_kill_urb() runs. > Below is a trace which shows the crash, and how during the completion > handler the use count is good for URB ebc54410, but because the URB > won't be resubmitted it drops to zero by the time the usb_kill_urb() > is called (note, URB ebc54410 is the second to last URB killed before > the crash). You can also see at the time the completion handler is > called for URB ebc54410 that while I've set the flag that the > completion handler looks for to return without resubmitting, I haven't > actually issued the usb_kill_urb() call, which is why the status is 0 > instead of -2. The log doesn't show submissions, so I'll take your word for it that the ebc54410 URB was never resubmitted. Then here's what we see: The URB is given back and the completion handler doesn't resubmit it. Sometime later we crash because usb_hcd_unlink_urb_from_ep() was called for that URB (which was freed in the meantime). > There's clearly a reference counting problem if the IRQ handler is > still referring to qh->urb entries in its list, but the underlying URB > is no longer in a submitted state. That said, I don't really know > enough about the internals of the EHCI controller model to know what > the right fix is. It may not be a reference counting problem at all. The scenario I outlined above should not be possible because after the URB is given back, ehci-hcd should not have any pointers to it in any data structures. This is something you may be able to check. First, a brief explanation of how things are supposed to work: qh_completions() primarily consists of a big loop: list_for_each_safe (entry, tmp, &qh->qtd_list) { struct ehci_qtd *qtd; struct urb *urb; u32 token = 0; qtd = list_entry (entry, struct ehci_qtd, qtd_list); urb = qtd->urb; ... This iterates through the qTDs on qh->qtd_list, and it uses the URB pointer stored in the qTD. The code calls ehci_urb_done() whenever it reaches a boundary between two URBs (note that each URB can have multiple qTDs). Thus, near the top of the loop it checks to see if the current URB is different from the one on the last iteration, and it does something similar just after the loop ends. When ehci_urb_done() is called, _all_ the qTDs for that URB are supposed to have been removed from qh->qtd_list. There should be nothing left that points to the URB being given back. The only way to get another qTD pointing to that URB is if the URB is resubmitted. Evidently this invariant isn't holding in your situation. You can test for it easily enough. Just before the two places where qh_completions() calls ehci_urb_done(), loop through all the qTDs on qh->qtd_list and see if any of them has qtd->urb equal to last->urb. If this happens, it's a bug. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html