Hi, On Sun, 2012-09-02 at 22:12 +0300, Dimitar Dimitrov wrote: > Very rare kernel crashes are reported on a custom OMAP4 board. Kernel > panics due to corrupted completion structure while executing > dispc_irq_wait_handler(). Excerpt from kernel log: > > Internal error: Oops - undefined instruction: 0 [#1] PREEMPT SMP > Unable to handle kernel paging request at virtual address 00400130 > ... > PC is at 0xebf205bc > LR is at __wake_up_common+0x54/0x94 > ... > (__wake_up_common+0x0/0x94) > (complete+0x0/0x60) > (dispc_irq_wait_handler.36902+0x0/0x14) > (omap_dispc_irq_handler+0x0/0x354) > (handle_irq_event_percpu+0x0/0x188) > (handle_irq_event+0x0/0x64) > (handle_fasteoi_irq+0x0/0x10c) > (generic_handle_irq+0x0/0x48) > (asm_do_IRQ+0x0/0xc0) > > DISPC IRQ executes callbacks with dispc.irq_lock released. Hence > unregister_isr() and DISPC IRQ might be running in parallel on different > CPUs. So there is a chance that a callback is executed even though it > has been unregistered. As omap_dispc_wait_for_irq_timeout() declares a > completion on stack, the dispc_irq_wait_handler() callback might try to > access a completion structure that is invalid. This leads to crashes and > hangs. > > Solution is to divide unregister calls into two sets: > 1. Non-strict unregistering of callbacks. Callbacks could safely be > executed after unregistering them. This is the case with unregister > calls from the IRQ handler itself. > 2. Strict (synchronized) unregistering. Callbacks are not allowed > after unregistering. This is the case with completion waiting. > > The above solution should satisfy one of the original intentions of the > driver: callbacks should be able to unregister themselves. I think it'd be better to create a new function for the nosync version, and keep the old name for the sync version. The reason for this is to minimize the amount of changes, as I think this one needs to be applied to stable kernel trees also. Also, I think we need similar one for dsi.c, as it has the same kind of irq handling. But with a quick glance only sync version is needed there. However, I'm not quite sure about this approach. The fix makes sense, but it makes me think if the irq handling is designed the wrong way. While debugging and fixing this, did you think some other irq handling approach would be saner? Tomi
Attachment:
signature.asc
Description: This is a digitally signed message part