panic at boot time with kernel >= 4.9.98 - uninitialized system_wq in early interrupt

Max Asbock <amax@xxxxxxxxxx> · Wed, 6 Jun 2018 23:26:46 +0000

We have been seeing kernel panics on certain systems with the 4.9.x stable kernel, where x >= 98.
We captured the following panic message:
[    6.252000] BUG: unable to handle kernel NULL pointer dereference at 0000000000000102
[    6.261222] IP: [<ffffffff810a8ea2>] __queue_work+0x32/0x430
[    6.267826] PGD 0 [    6.269951]
[    6.271887] Oops: 0000 [#1] SMP
[    6.275576] Modules linked in:
[    6.279264] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.100 #1
[    6.296760] task: ffffffff81c114c0 task.stack: ffffffff81c00000
[    6.303564] RIP: 0010:[<ffffffff810a8ea2>]  [<ffffffff810a8ea2>] __queue_work+0x32/0x430
[    6.312997] RSP: 0000:ffff897a40403d98  EFLAGS: 00010046
[    6.319120] RAX: 0000000000000082 RBX: 0000000000000046 RCX: 0000000000000000
[    6.327282] RDX: ffffffff81d02300 RSI: 0000000000000000 RDI: 0000000000002000
[    6.335443] RBP: ffff897a40403dd0 R08: 00000000d431dd4b R09: 0000000000000000
[    6.343606] R10: ffff897a40403e08 R11: ffffffff8227615c R12: ffffffff81d02300
[    6.351766] R13: 0000000000002000 R14: 0000000000000000 R15: ffffffff81a88ffd
[    6.359930] FS:  0000000000000000(0000) GS:ffff897a40400000(0000) knlGS:0000000000000000
[    6.369274] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    6.375883] CR2: 0000000000000102 CR3: 0000000001c08000 CR4: 0000000000040630
[    6.384045] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    6.392206] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    6.400366] Stack:
[    6.402801]  0000001000000000 0000200040403e08 0000000000000046 0000000000000002
[    6.411723]  ffffffff82275f40 ffff897a40403e08 ffffffff81a88ffd ffff897a40403de8
[    6.420638]  ffffffff810a9677 ffffffff82275f88 ffff897a40403e78 ffffffff8148dff4
[    6.429554] Call Trace:
[    6.432473]  <IRQ> [    6.434698]  [<ffffffff810a9677>] queue_work_on+0x27/0x40
[    6.441121]  [<ffffffff8148dff4>] crng_reseed+0x154/0x260
[    6.447343]  [<ffffffff8148e334>] credit_entropy_bits+0x234/0x2a0
[    6.454343]  [<ffffffff8148e77b>] ? add_interrupt_randomness+0x1bb/0x220
[    6.462020]  [<ffffffff8148e77b>] add_interrupt_randomness+0x1bb/0x220
[    6.469506]  [<ffffffff810ef8f0>] handle_irq_event_percpu+0x40/0x80
[    6.476698]  [<ffffffff810ef96b>] handle_irq_event+0x3b/0x60
[    6.483211]  [<ffffffff810f2e7f>] handle_level_irq+0x8f/0x110
[    6.489823]  [<ffffffff81030fd5>] handle_irq+0xb5/0x140
[    6.495854]  [<ffffffff81096651>] ? _local_bh_enable+0x21/0x50
[    6.502562]  [<ffffffff81038ab5>] ? __exit_idle+0x5/0x30
[    6.508689]  [<ffffffff817250cd>] do_IRQ+0x4d/0xe0
[    6.514232]  [<ffffffff81722ba0>] common_interrupt+0xa0/0xa0
[    6.520744]  <EOI> [    6.522968]  [<ffffffff81ddeb08>] ? vfs_caches_init+0xe1/0xe3
[    6.529779]  [<ffffffff81da212e>] start_kernel+0x451/0x4c8
[    6.536096]  [<ffffffff81da1acb>] ? set_init_arg+0x55/0x55
[    6.542415]  [<ffffffff81da1120>] ? early_idt_handler_array+0x120/0x120
[    6.549995]  [<ffffffff81da15d6>] x86_64_start_reservations+0x2a/0x2c
[    6.557381]  [<ffffffff81da1714>] x86_64_start_kernel+0x13c/0x15f
[    6.564378] Code: 89 e5 41 57 41 56 49 89 f6 41 55 41 89 fd 41 54 49 89 d4 53 48 83 ec 10 89 7d d4 ff 14 25 80 86 c3 81 f6 c4 02 0f 85 1f 03 00 00 <41> f6 86 02 01 00 00 01 0f 85 b2 02 00 00 49 c7 c7 78 69 01 00
[    6.591500] RIP  [<ffffffff810a8ea2>] __queue_work+0x32/0x430
[    6.598197]  RSP <ffff897a40403d98>
[    6.602281] CR2: 0000000000000102

The NULL pointer dereference happens because the second argument to __queue_work is 0. The second argument is a struct workqueue_struct, specifically system_wq. This shows that an interrupt happened before the system_wq was initialized. We believe the problem was introduced with this commit:

$ git show c3ff2da5cef05676d490fa9057b2dceb5e48cdb9
commit c3ff2da5cef05676d490fa9057b2dceb5e48cdb9
Author: Theodore Ts'o <tytso@xxxxxxx>
Date:   Mon Apr 23 18:51:28 2018 -0400

    random: fix possible sleeping allocation from irq context
    
    commit 6c1e851c4edc13a43adb3ea4044e3fc8f43ccf7d upstream.
    
    We can do a sleeping allocation from an irq context when CONFIG_NUMA
    is enabled.  Fix this by initializing the NUMA crng instances in a
    workqueue.
    
    Reported-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
    Reported-by: syzbot+9de458f6a5e713ee8c1a@xxxxxxxxxxxxxxxxxxxxxxxxx
    Fixes: 8ef35c866f8862df ("random: set up the NUMA crng instances...")
    Cc: stable@xxxxxxxxxxxxxxx
    Signed-off-by: Theodore Ts'o <tytso@xxxxxxx>
    Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>

diff --git a/drivers/char/random.c b/drivers/char/random.c
index b979173..dbfb3e69 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -820,7 +820,7 @@ static int crng_fast_load(const char *cp, size_t len)
 }
 
 #ifdef CONFIG_NUMA
-static void numa_crng_init(void)
+static void do_numa_crng_init(struct work_struct *work)
 {
        int i;
        struct crng_state *crng;
@@ -841,6 +841,13 @@ static void numa_crng_init(void)
                kfree(pool);
        }
 }
+
+static DECLARE_WORK(numa_crng_init_work, do_numa_crng_init);
+
+static void numa_crng_init(void)
+{
+       schedule_work(&numa_crng_init_work);
+}
 #else
 static void numa_crng_init(void) {}
 #endif

Apparently we can't count on system_wq being initialized when schedule_work is called in numa_crng_init from an early interrupt.

I don't understand the underlying code enough to propose a fix right away. 

thanks,
Max