On Mon, 19 Jan 2015, Heiko Przybyl wrote: > On Monday 19 January 2015 11:17:59 Alan Stern wrote: > > On Mon, 19 Jan 2015, Heiko Przybyl wrote: > > > It seems to be related to keyboard input (at least it happens when using > > > the keyboard), without relation to system load. Can happen within a day > > > after boot or after several days of hibernated uptime. Unfortunately, I > > > haven't found a way to reliably reproduce the issue, yet. > > > > > > [..] > > > > > > My (pretty wild) guess is, that the corruption happens through a race in > > > the interrupt handler ohci_irq(), which calls ohci_work(), which calls > > > finish_urb(), which states: > > > " * PRECONDITION: ohci lock held, irqs blocked" > > > > > > But ohci_irq() seems to only spin_[un]lock(), not spin_[un]lock_irq[save| > > > restore](). All other functions that call ohci_work() do at least > > > spin_[un]lock_irq. So irqs could still be enabled and possibly the event > > > triggered twice, thus the double list add? > > > > That's easy enough to test. All you have to do is change the > > spin_lock/unlock statements to their irq_save/restore variants. > > Well, thought about that as well, but I'm not sure when to take it as fixed and > when to take it as issue-just-didn't-happen-yet, because of the not-so- > deterministic occurrence of the error. But I can try it out anyway, just > wanted to have some feedback before trying. By the way, failing to disable interrupts when acquiring a spinlock generally does not lead to data corruption -- it leads to deadlocks. So I doubt this is the cause of your problem. If you really want to, you could add a WARN_ON(!irqs_disabled()); line to ohci_irq(). > > If that's not the explanation then we'll have to dig deeper. > > I can still work on a saved vmcore dump of a crash. Btw. using crash(1) and > its command `bt -E`shows two kernel mode exceptions. Though, I can't figure out > where the first one originates from > > CPU 3 IRQ STACK: > KERNEL-MODE EXCEPTION FRAME AT: ffff88022ecc3638 > [exception RIP: _raw_spin_unlock_irqrestore+9] > RIP: ffffffff814774b9 RSP: ffff88022ecc36e8 RFLAGS: 00000202 > RAX: ffff88022ecc36a8 RBX: ffff88022ecc36b0 RCX: ffffffff81290279 > RDX: 0000000000002dff RSI: 0000000000000000 RDI: ffff88022ecc3788 > RBP: ffff88022ecc36e8 R8: 0000000000000080 R9: 0000000000000023 > R10: ffffffff813e6407 R11: ffffea000863ad80 R12: ffff88022ecc3658 > R13: ffffffff81478b2a R14: ffff88022ecc36e8 R15: 0000000000000001 > ORIG_RAX: ffffffff81471cfd CS: 0010 SS: 0018 > > 0xffffffff814774b9 <+9>: decl %gs:0xa860 No idea. It might be a good idea for you to try something a little more invasive. How about writing a routine to check the entire ohci->eds_in_use list for validity (each forward pointer is matched by the corresponding backward pointer), and calling this routine at each place where the list gets modified, before the modification happens? You could also make sure that an entry being added to the list isn't on the list already, and whenever an entry is deleted from the list either it really is on the list or else its list pointers point to themselves. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html