On Tue, Jul 05, 2016 at 07:59:59PM -0500, Corey Minyard wrote: > I'm having our hardware people keep the system as-is until we can > track this down. > > A applied the above four patches and a few more support patches got that > were needed, but no love. Exact same issue. Well, almost the same, here's > the traceback: > > [ 0.455575] [<ffffffff810733c4>] try_to_wake_up+0x34/0x300 > [ 0.455590] [<ffffffff81067d76>] ? __hrtimer_start_range_ns+0x226/0x3a0 > [ 0.455593] [<ffffffff810736e0>] wake_up_process+0x10/0x20 > [ 0.455615] [<ffffffff8101c7a8>] mce_notify_irq+0x28/0x30 > [ 0.455621] [<ffffffff8101cbd9>] mce_irq_work_cb+0x9/0x10 > [ 0.455646] [<ffffffff810cbb0c>] irq_work_run_list+0x3c/0x60 > [ 0.455649] [<ffffffff810cbe97>] irq_work_tick_soft+0x27/0x30 > [ 0.455673] [<ffffffff8104dbe4>] run_timer_softirq+0x24/0x250 > [ 0.455681] [<ffffffff81045bce>] do_current_softirqs+0x1ae/0x250 > [ 0.455684] [<ffffffff81045c9e>] run_ksoftirqd+0x2e/0x50 > [ 0.455697] [<ffffffff8106c7f6>] smpboot_thread_fn+0x206/0x320 > [ 0.455700] [<ffffffff8106c5f0>] ? lg_global_unlock+0x60/0x60 > [ 0.455720] [<ffffffff81063cad>] kthread+0xad/0xc0 > [ 0.455740] [<ffffffff81730303>] ? _dbgp_external_startup+0x236/0x392 > [ 0.455744] [<ffffffff81063c00>] ? kthread_create_on_node+0x130/0x130 > [ 0.455752] [<ffffffff8173a4be>] ret_from_fork+0x4e/0x80 > [ 0.455756] [<ffffffff81063c00>] ? kthread_create_on_node+0x130/0x130 > > > So it crashed in the kthread instead of the irq, but exactly the same issue, > that particular field is not initialized. Not that these aren't patches > that look like good ideas. Hmm, so this looks like RT-specific now AFAICT. mce_notify_irq() calls mce_notify_work() and on RT_FULL that's trying to wake up mce_notify_helper which is not initialized yet - mce_notify_work_init() happens later in a device_initcall_sync. Would something as trivial as this work in your case? --- diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index aaf4b9b94f38..cc70d98a30f6 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -1391,7 +1391,8 @@ static int mce_notify_work_init(void) static void mce_notify_work(void) { - wake_up_process(mce_notify_helper); + if (mce_notify_helper) + wake_up_process(mce_notify_helper); } #else static void mce_notify_work(void) -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html