Re: [PATCH][RT] x86: Fix an RT MCE crash

Corey Minyard <minyard@xxxxxxx> · Tue, 5 Jul 2016 19:59:59 -0500

On 07/01/2016 02:20 AM, Borislav Petkov wrote:
That sounds like a bit much.
Actually, you probably would need only a couple:

1. 648ed94038c0 ("x86/mce: Provide a lockless memory pool to save error records")

2. 061120aed708 ("x86/mce: Don't use percpu workqueues")
  - that one is unrelated but should be nice for RT as it gets rid of percpu
    workqueues and I know RT hates them :)

3. fd4cf79fcc4b ("x86/mce: Remove the MCE ring for Action Optional errors")
  - this one connects the genpool to MCE

4. f29a7aff4bd6 ("x86/mce: Avoid potential deadlock due to printk() in MCE context")
  - and this is the last one which I meant earlier.

So that's 4 patches, more or less.

Now, you're in the perfect position to test those because you *actually*
have a real-life system which generates those errors so it is the
perfect candidate for testing the backports. And you should test them
with the failing DIMM still in place, of course.

I'm having our hardware people keep the system as-is until we can
track this down.

A applied the above four patches and a few more support patches got that
were needed, but no love.  Exact same issue.  Well, almost the same, here's
the traceback:

[    0.455575]  [<ffffffff810733c4>] try_to_wake_up+0x34/0x300
[    0.455590]  [<ffffffff81067d76>] ? __hrtimer_start_range_ns+0x226/0x3a0
[    0.455593]  [<ffffffff810736e0>] wake_up_process+0x10/0x20
[    0.455615]  [<ffffffff8101c7a8>] mce_notify_irq+0x28/0x30
[    0.455621]  [<ffffffff8101cbd9>] mce_irq_work_cb+0x9/0x10
[    0.455646]  [<ffffffff810cbb0c>] irq_work_run_list+0x3c/0x60
[    0.455649]  [<ffffffff810cbe97>] irq_work_tick_soft+0x27/0x30
[    0.455673]  [<ffffffff8104dbe4>] run_timer_softirq+0x24/0x250
[    0.455681]  [<ffffffff81045bce>] do_current_softirqs+0x1ae/0x250
[    0.455684]  [<ffffffff81045c9e>] run_ksoftirqd+0x2e/0x50
[    0.455697]  [<ffffffff8106c7f6>] smpboot_thread_fn+0x206/0x320
[    0.455700]  [<ffffffff8106c5f0>] ? lg_global_unlock+0x60/0x60
[    0.455720]  [<ffffffff81063cad>] kthread+0xad/0xc0
[    0.455740]  [<ffffffff81730303>] ? _dbgp_external_startup+0x236/0x392
[    0.455744]  [<ffffffff81063c00>] ? kthread_create_on_node+0x130/0x130
[    0.455752]  [<ffffffff8173a4be>] ret_from_fork+0x4e/0x80
[    0.455756]  [<ffffffff81063c00>] ? kthread_create_on_node+0x130/0x130

So it crashed in the kthread instead of the irq, but exactly the same issue,
that particular field is not initialized.  Not that these aren't patches 
that look
like good ideas.

-corey
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html