Re: ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()

Heiko Przybyl <lil_tux@xxxxxx> · Sun, 10 May 2015 22:21:24 +0200

Hello again.

The issue still bites me with 4.0.2.

Though I was able to narrow down the issue, it's still hard to 
deterministically trigger it.

Anyway. The setup:
- OHCI controller with hooked up keyboard
- xhddled (program that toggles scrolllock on disk activity)
- disk activity

I'm very certain that it happens, when usbhid issues an urb for scrolllock led 
toggling, the ohci driver adding a duplicate to its eds_in_use list. That's 
why I hit it way more often with disk activity. I also tried to alter the 
program to just periodically toggle the scrolllock, but that didn't seem to 
help for reproducibility.

On Friday 23 January 2015 15:21:11 Alan Stern wrote:
> On Tue, 20 Jan 2015, Heiko Przybyl wrote:
> > I'm not 100% sure, but then it's probably a race between urb
> > enqueuing (duplicates?) and the watchdog orphan cleanup.
> > 
> > The crash log already shows the double list add in ohci_urb_enqueue
> > "
> > ohci-hcd.c:238: list_add(&ed->in_use_list, &ohci->eds_in_use);
> > "
> > This is probably due to the ed returned by ed_get() being reused before
> > the
> > watchdog ran, thus the same in_use_list re-added to ohci.eds_in_use.
> 
> But this won't happen unless ed->state was ED_IDLE; and then
> ed_schedule() changes ed->state to ED_OPER before ed is added to the
> in_use_list.

As suggested, checking the eds_in_use list was a good idea. Now, I'm even more 
convinced duplicates are being added to the eds_in_use list. 

Check right before the list_add in ohci_urb_enqueue():

[73942.398966] 12.0: oc(pre add): elem ffff88009faff040 not expected but FOUND
[73942.407393] 12.0:  e ffff88009faff040 e.prev ffff8800c9c845b8 e.next 
ffff88009faff120
[73942.416710] 12.0:  e ffff88009faff120 e.prev ffff88009faff040 e.next 
ffff88009faff0b0
[73942.426044] 12.0:  e ffff88009faff0b0 e.prev ffff88009faff120 e.next 
ffff8800c9c845b8
[73942.436752] 12.0: oc(pre add): EOL

[73942.675303] Call Trace:
[73942.677779]  <IRQ>
[73942.679755]  [<ffffffffa001bad0>] ohci_check+0x250/0x250 [ohci_hcd]
[73942.686313]  [<ffffffffa00215ac>] ohci_urb_enqueue+0x84c/0xb9a [ohci_hcd]
[73942.693127]  [<ffffffff810ac079>] ? select_task_rq_fair+0x469/0x7f0
[73942.699489]  [<ffffffff813ba4f9>] usb_hcd_submit_urb+0xa1/0xae0
[73942.705427]  [<ffffffff810afb5d>] ? update_group_capacity+0x24d/0x278
[73942.711929]  [<ffffffff8128e08f>] ? cpumask_next_and+0x2f/0x40
[73942.717791]  [<ffffffff813bc246>] usb_submit_urb+0x42e/0x5e8
[73942.723491]  [<ffffffffa01ef6ab>] hid_submit_ctrl+0x12b/0x290 [usbhid]
[73942.730042]  [<ffffffffa01f07c3>] hid_ctrl+0xeb/0x170 [usbhid]
[73942.735929]  [<ffffffff813b897c>] __usb_hcd_giveback_urb+0x64/0xe8
[73942.742165]  [<ffffffff813b8b0e>] usb_hcd_giveback_urb+0x3e/0x130
[73942.748267]  [<ffffffffa001b734>] finish_urb+0x74/0x158 [ohci_hcd]
[73942.754459]  [<ffffffffa001d1e4>] ohci_work.part.4+0x26c/0x5f8 [ohci_hcd]
[73942.761272]  [<ffffffffa0020c60>] ohci_irq+0x218/0x318 [ohci_hcd]
[73942.767394]  [<ffffffff813b8399>] usb_hcd_irq+0x21/0x38
[73942.772653]  [<ffffffff810c7048>] handle_irq_event_percpu+0x60/0x108
[73942.779029]  [<ffffffff810c712c>] handle_irq_event+0x3c/0x60
[73942.784718]  [<ffffffff810c9d21>] handle_fasteoi_irq+0x81/0x130
[73942.790675]  [<ffffffff81004acd>] handle_irq+0x1d/0x40
[73942.795844]  [<ffffffff810044aa>] do_IRQ+0x4a/0xe0
[73942.800669]  [<ffffffff814c28aa>] common_interrupt+0x6a/0x6a

When checking after the list_add, there's a loop in the linked list, making it 
endless and probably locking the machine.

12.0:  e ffff8800c9993120 e.prev ffff8800cbb065b8 e.next ffff8800c9993120
12.0: list corruption: p(ffff8800c9993120) != e.prev(ffff8800cbb065b8)

[21042.467670] Call Trace:
[21042.470162]  <IRQ>
[21042.472168]  [<ffffffffa0056646>] ohci_urb_enqueue+0x876/0xb9a [ohci_hcd]
[21042.479327]  [<ffffffff810b63ff>] ? __wake_up_sync_key+0x4f/0x60
[21042.485474]  [<ffffffff810ac079>] ? select_task_rq_fair+0x469/0x7f0
[21042.491846]  [<ffffffff813ba4f9>] usb_hcd_submit_urb+0xa1/0xae0
[21042.497873]  [<ffffffff810b5f73>] ? __wake_up+0x43/0x58
[21042.503231]  [<ffffffff813712c8>] ? rpm_idle+0x18/0x148
[21042.508530]  [<ffffffff813bc246>] usb_submit_urb+0x42e/0x5e8
[21042.514302]  [<ffffffffa00e46ab>] hid_submit_ctrl+0x12b/0x290 [usbhid]
[21042.520964]  [<ffffffff810b5f73>] ? __wake_up+0x43/0x58
[21042.526300]  [<ffffffffa00e57c3>] hid_ctrl+0xeb/0x170 [usbhid]
[21042.532202]  [<ffffffff813b897c>] __usb_hcd_giveback_urb+0x64/0xe8
[21042.538497]  [<ffffffff813b8b0e>] usb_hcd_giveback_urb+0x3e/0x130
[21042.544689]  [<ffffffffa0050734>] finish_urb+0x74/0x158 [ohci_hcd]
[21042.551005]  [<ffffffffa0052254>] ohci_work.part.4+0x26c/0x5f8 [ohci_hcd]
[21042.557940]  [<ffffffffa0055cd0>] ohci_irq+0x218/0x318 [ohci_hcd]
[21042.564152]  [<ffffffff813b8399>] usb_hcd_irq+0x21/0x38
[21042.569485]  [<ffffffff810c7048>] handle_irq_event_percpu+0x60/0x108
[21042.575968]  [<ffffffff810c712c>] handle_irq_event+0x3c/0x60
[21042.581727]  [<ffffffff810c9d21>] handle_fasteoi_irq+0x81/0x130
[21042.587781]  [<ffffffff81004acd>] handle_irq+0x1d/0x40
[21042.593027]  [<ffffffff810044aa>] do_IRQ+0x4a/0xe0
[21042.597909]  [<ffffffff814c28aa>] common_interrupt+0x6a/0x6a

> 
> And finish_unlinks() is where ed->state gets set back to ED_IDLE.
> Furthermore, all these operations take place under the protection of
> ohci->lock, so they can't race with each other.
> 
> They have to happen in strict sequence: list_add, list_del, list_add.
> The second list_add can't occur until after the list_del, because
> ed->state won't be equal to ED_IDLE until then.
> 
> At least, that's how it's _supposed_ to work...

Well, finish_urb() still does unlock the ohci->lock before calling 
usb_hcd_giveback_urb(), because the completion routine can reenter. It then 
locks again after that call. Unfortunately, usbhid's completion routine 
hid_ctrl() does infact issue another urb on the same ed. So I'd say there's 
a small time slot where things can go awkward, if the newly submitted urb 
arrives before the ohci lock gets locked again in finish_urb.

What I'm driving at is something in the way of:

finish_urb()
_unlock()
_usb_hcd_giveback_urb()
__hid_ctrl()
___urb_enqueue()
____lock()
____list_add()
____unlock()
_lock()
list_del()

Would it be feasible to run usb_hcd_giveback_urb() after the cleanup parts?

> 
> It's a puzzling problem.  I hope you can find the cause.
Well, I'm puzzled and I hope I'm pointing at the right location :-)

> 
> Alan Stern

Kind regards,

   Heiko

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html