Re: ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Tue, 20 Jan 2015 10:49:29 -0500 (EST)

On Mon, 19 Jan 2015, Heiko Przybyl wrote:

> On Monday 19 January 2015 11:17:59 Alan Stern wrote:
> > On Mon, 19 Jan 2015, Heiko Przybyl wrote:
> > > It seems to be related to keyboard input (at least it happens when using
> > > the keyboard), without relation to system load. Can happen within a day
> > > after boot or after several days of hibernated uptime. Unfortunately, I
> > > haven't found a way to reliably reproduce the issue, yet.
> > > 
> > > [..]
> > > 
> > > My (pretty wild) guess is, that the corruption happens through a race in
> > > the interrupt handler ohci_irq(), which calls ohci_work(), which calls
> > > finish_urb(), which states:
> > > " * PRECONDITION:  ohci lock held, irqs blocked"
> > > 
> > > But ohci_irq() seems to only spin_[un]lock(), not spin_[un]lock_irq[save|
> > > restore](). All other functions that call ohci_work() do at least
> > > spin_[un]lock_irq. So irqs could still be enabled and possibly the event
> > > triggered twice, thus the double list add?
> > 
> > That's easy enough to test.  All you have to do is change the
> > spin_lock/unlock statements to their irq_save/restore variants.
> 
> Well, thought about that as well, but I'm not sure when to take it as fixed and 
> when to take it as issue-just-didn't-happen-yet, because of the not-so-
> deterministic occurrence of the error. But I can try it out anyway, just 
> wanted to have some feedback before trying.

By the way, failing to disable interrupts when acquiring a spinlock 
generally does not lead to data corruption -- it leads to deadlocks.  
So I doubt this is the cause of your problem.  If you really want to, 
you could add a

	WARN_ON(!irqs_disabled());

line to ohci_irq().

> > If that's not the explanation then we'll have to dig deeper.
> 
> I can still work on a saved vmcore dump of a crash. Btw. using crash(1) and 
> its command `bt -E`shows two kernel mode exceptions. Though, I can't figure out 
> where the first one originates from
> 
> CPU 3 IRQ STACK:
>   KERNEL-MODE EXCEPTION FRAME AT: ffff88022ecc3638
>     [exception RIP: _raw_spin_unlock_irqrestore+9]
>     RIP: ffffffff814774b9  RSP: ffff88022ecc36e8  RFLAGS: 00000202
>     RAX: ffff88022ecc36a8  RBX: ffff88022ecc36b0  RCX: ffffffff81290279
>     RDX: 0000000000002dff  RSI: 0000000000000000  RDI: ffff88022ecc3788
>     RBP: ffff88022ecc36e8   R8: 0000000000000080   R9: 0000000000000023
>     R10: ffffffff813e6407  R11: ffffea000863ad80  R12: ffff88022ecc3658
>     R13: ffffffff81478b2a  R14: ffff88022ecc36e8  R15: 0000000000000001
>     ORIG_RAX: ffffffff81471cfd  CS: 0010  SS: 0018
> 
>     0xffffffff814774b9 <+9>:     decl   %gs:0xa860

No idea.

It might be a good idea for you to try something a little more 
invasive.  How about writing a routine to check the entire 
ohci->eds_in_use list for validity (each forward pointer is matched by 
the corresponding backward pointer), and calling this routine at each 
place where the list gets modified, before the modification happens?

You could also make sure that an entry being added to the list isn't on 
the list already, and whenever an entry is deleted from the list 
either it really is on the list or else its list pointers point to 
themselves.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html