Re: panic at boot time with kernel >= 4.9.98 - uninitialized system_wq in early interrupt

Greg KH <greg@xxxxxxxxx> · Thu, 7 Jun 2018 10:37:14 +0200

On Wed, Jun 06, 2018 at 11:26:46PM +0000, Max Asbock wrote:
> 
> We have been seeing kernel panics on certain systems with the 4.9.x stable kernel, where x >= 98.
> We captured the following panic message:
> [    6.252000] BUG: unable to handle kernel NULL pointer dereference at 0000000000000102
> [    6.261222] IP: [<ffffffff810a8ea2>] __queue_work+0x32/0x430
> [    6.267826] PGD 0 [    6.269951]
> [    6.271887] Oops: 0000 [#1] SMP
> [    6.275576] Modules linked in:
> [    6.279264] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.100 #1
> [    6.296760] task: ffffffff81c114c0 task.stack: ffffffff81c00000
> [    6.303564] RIP: 0010:[<ffffffff810a8ea2>]  [<ffffffff810a8ea2>] __queue_work+0x32/0x430
> [    6.312997] RSP: 0000:ffff897a40403d98  EFLAGS: 00010046
> [    6.319120] RAX: 0000000000000082 RBX: 0000000000000046 RCX: 0000000000000000
> [    6.327282] RDX: ffffffff81d02300 RSI: 0000000000000000 RDI: 0000000000002000
> [    6.335443] RBP: ffff897a40403dd0 R08: 00000000d431dd4b R09: 0000000000000000
> [    6.343606] R10: ffff897a40403e08 R11: ffffffff8227615c R12: ffffffff81d02300
> [    6.351766] R13: 0000000000002000 R14: 0000000000000000 R15: ffffffff81a88ffd
> [    6.359930] FS:  0000000000000000(0000) GS:ffff897a40400000(0000) knlGS:0000000000000000
> [    6.369274] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    6.375883] CR2: 0000000000000102 CR3: 0000000001c08000 CR4: 0000000000040630
> [    6.384045] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    6.392206] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    6.400366] Stack:
> [    6.402801]  0000001000000000 0000200040403e08 0000000000000046 0000000000000002
> [    6.411723]  ffffffff82275f40 ffff897a40403e08 ffffffff81a88ffd ffff897a40403de8
> [    6.420638]  ffffffff810a9677 ffffffff82275f88 ffff897a40403e78 ffffffff8148dff4
> [    6.429554] Call Trace:
> [    6.432473]  <IRQ> [    6.434698]  [<ffffffff810a9677>] queue_work_on+0x27/0x40
> [    6.441121]  [<ffffffff8148dff4>] crng_reseed+0x154/0x260
> [    6.447343]  [<ffffffff8148e334>] credit_entropy_bits+0x234/0x2a0
> [    6.454343]  [<ffffffff8148e77b>] ? add_interrupt_randomness+0x1bb/0x220
> [    6.462020]  [<ffffffff8148e77b>] add_interrupt_randomness+0x1bb/0x220
> [    6.469506]  [<ffffffff810ef8f0>] handle_irq_event_percpu+0x40/0x80
> [    6.476698]  [<ffffffff810ef96b>] handle_irq_event+0x3b/0x60
> [    6.483211]  [<ffffffff810f2e7f>] handle_level_irq+0x8f/0x110
> [    6.489823]  [<ffffffff81030fd5>] handle_irq+0xb5/0x140
> [    6.495854]  [<ffffffff81096651>] ? _local_bh_enable+0x21/0x50
> [    6.502562]  [<ffffffff81038ab5>] ? __exit_idle+0x5/0x30
> [    6.508689]  [<ffffffff817250cd>] do_IRQ+0x4d/0xe0
> [    6.514232]  [<ffffffff81722ba0>] common_interrupt+0xa0/0xa0
> [    6.520744]  <EOI> [    6.522968]  [<ffffffff81ddeb08>] ? vfs_caches_init+0xe1/0xe3
> [    6.529779]  [<ffffffff81da212e>] start_kernel+0x451/0x4c8
> [    6.536096]  [<ffffffff81da1acb>] ? set_init_arg+0x55/0x55
> [    6.542415]  [<ffffffff81da1120>] ? early_idt_handler_array+0x120/0x120
> [    6.549995]  [<ffffffff81da15d6>] x86_64_start_reservations+0x2a/0x2c
> [    6.557381]  [<ffffffff81da1714>] x86_64_start_kernel+0x13c/0x15f
> [    6.564378] Code: 89 e5 41 57 41 56 49 89 f6 41 55 41 89 fd 41 54 49 89 d4 53 48 83 ec 10 89 7d d4 ff 14 25 80 86 c3 81 f6 c4 02 0f 85 1f 03 00 00 <41> f6 86 02 01 00 00 01 0f 85 b2 02 00 00 49 c7 c7 78 69 01 00
> [    6.591500] RIP  [<ffffffff810a8ea2>] __queue_work+0x32/0x430
> [    6.598197]  RSP <ffff897a40403d98>
> [    6.602281] CR2: 0000000000000102
> 
> The NULL pointer dereference happens because the second argument to __queue_work is 0. The second argument is a struct workqueue_struct, specifically system_wq. This shows that an interrupt happened before the system_wq was initialized. We believe the problem was introduced with this commit:
> 
> $ git show c3ff2da5cef05676d490fa9057b2dceb5e48cdb9
> commit c3ff2da5cef05676d490fa9057b2dceb5e48cdb9
> Author: Theodore Ts'o <tytso@xxxxxxx>
> Date:   Mon Apr 23 18:51:28 2018 -0400
> 
>     random: fix possible sleeping allocation from irq context
>     
>     commit 6c1e851c4edc13a43adb3ea4044e3fc8f43ccf7d upstream.
>     
>     We can do a sleeping allocation from an irq context when CONFIG_NUMA
>     is enabled.  Fix this by initializing the NUMA crng instances in a
>     workqueue.
>     
>     Reported-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
>     Reported-by: syzbot+9de458f6a5e713ee8c1a@xxxxxxxxxxxxxxxxxxxxxxxxx
>     Fixes: 8ef35c866f8862df ("random: set up the NUMA crng instances...")
>     Cc: stable@xxxxxxxxxxxxxxx
>     Signed-off-by: Theodore Ts'o <tytso@xxxxxxx>
>     Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
> 
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index b979173..dbfb3e69 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -820,7 +820,7 @@ static int crng_fast_load(const char *cp, size_t len)
>  }
>  
>  #ifdef CONFIG_NUMA
> -static void numa_crng_init(void)
> +static void do_numa_crng_init(struct work_struct *work)
>  {
>         int i;
>         struct crng_state *crng;
> @@ -841,6 +841,13 @@ static void numa_crng_init(void)
>                 kfree(pool);
>         }
>  }
> +
> +static DECLARE_WORK(numa_crng_init_work, do_numa_crng_init);
> +
> +static void numa_crng_init(void)
> +{
> +       schedule_work(&numa_crng_init_work);
> +}
>  #else
>  static void numa_crng_init(void) {}
>  #endif
> 
> Apparently we can't count on system_wq being initialized when schedule_work is called in numa_crng_init from an early interrupt.
> 
> I don't understand the underlying code enough to propose a fix right away. 

Ick :(

I'm guessing you also see these problems on 4.17?  Can you test there to
be sure of that?

thanks,

greg k-h