Re: [PATCH][RT] x86: Fix an RT MCE crash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 06/30/2016 01:22 PM, Borislav Petkov wrote:
On Thu, Jun 30, 2016 at 12:54:14PM -0500, Corey Minyard wrote:
It won't crash.  If you disable PREEMPT_RT on the 3.10-rt kernel it won't
crash (which I have tested).  With PREEMPT_RT, the kernel creates a
separate thread that is woken on mce notifications.  The trouble is
that the interrupts are initialized before the thread is created.
Hmmm.

Ok, so I don't have any idea what RT does but from looking at your splat:

[    0.164153] Call Trace:
[    0.164165]  <IRQ>
[    0.164185]  [<ffffffff8106dcd8>] try_to_wake_up+0x28/0x320
[    0.164188]  [<ffffffff8106dfe0>] wake_up_process+0x10/0x20
[    0.164207]  [<ffffffff8101c548>] mce_notify_irq+0x28/0x30
[    0.164210]  [<ffffffff8101df35>] intel_threshold_interrupt+0xb5/0xd0
[    0.164213]  [<ffffffff8101e88c>] smp_threshold_interrupt+0x1c/0x40
[    0.164221]  [<ffffffff816f9b5a>] threshold_interrupt+0x6a/0x70
[    0.164223]  <EOI>
[    0.164226]  [<ffffffff8101dda7>] ? cmci_recheck+0x67/0x70
[    0.164241]  [<ffffffff816e9777>] setup_local_APIC+0x276/0x283
[    0.164259]  [<ffffffff81caf010>] native_smp_prepare_cpus+0x379/0x43b
[    0.164266]  [<ffffffff81ca3e4f>] kernel_init_freeable+0xd7/0x21a
[    0.164270]  [<ffffffff816df1f0>] ? rest_init+0x90/0x90
[    0.164272]  [<ffffffff816df1f9>] kernel_init+0x9/0x180
[    0.164275]  [<ffffffff816f8dc8>] ret_from_fork+0x58/0x90
[    0.164277]  [<ffffffff816df1f0>] ? rest_init+0x90/0x90
[    0.164295] Code: e7 ff ff 48 8b 7d 08 e8 02 1a 95 ff 5d c3 55 48 89 e5 41
54 53 48 89 fb 9c 41 5c fa bf 01 00 00 00 e8 a8 38 00 00 ba 00 01 00 00 <f0>
66 0f c1 13 0f b6 ce 38 d1 74 10 0f 1f 80 00 00 00 00 f3 90
[    0.164298] RIP  [<ffffffff816f344d>] _raw_spin_lock_irqsave+0x1d/0x50
[    0.164298]  RSP <ffff88017fa03f00>
[    0.164299] CR2: 0000000000000600
[    0.656225] ---[ end trace 0000000000000001 ]---
[    0.656233] Kernel panic - not syncing: Fatal exception in interrupt

we're 0.16 seconds within the boot and we're just initializing the local
APIC and the moment that happens, we get a thresholding APIC interrupt.

So how can interrupts be initialized before that?

I don't think they are.  I think there is something about this
particular board.  We aren't having any issues with other systems.

But as you say, the kernel should be ready for this.


I'm genuinely asking because I can't imagine how CMCI can get initialized
*after* the local APIC init.

Because, we do init CMCI in identify_cpu()->mcheck_cpu_init() and that
happens earlier than your splat. You can even see where it happens in
dmesg:

[    0.049270] mce: CPU supports 22 MCE banks
[    0.049383] CPU0: Thermal monitoring enabled (TM1)

First line is __mcheck_cpu_cap_init(), second is intel_init_thermal().

The CMCI initialization is done right after it in

void mce_intel_feature_init(struct cpuinfo_x86 *c)
{
         intel_init_thermal(c);
         intel_init_cmci();


but wait!, this is the upstream kernel. Where can I look at 3.10-rt
sources?

They are at:

git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git v3.10-rt

-corey
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [RT Stable]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]

  Powered by Linux