Re: kernel BUG at lib/list_debug.c:54! RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47

Tejun Heo <tj@xxxxxxxxxx> · Tue, 22 Feb 2022 10:15:34 -1000

Hello,

On Wed, Feb 09, 2022 at 10:23:18AM -0700, Chris Murphy wrote:
> I hit this bug out of the blue (haven't seen it before) with 5.16.5,
> the activity at the time was logging out of GNOME shell, and dropping
> to a tty, and then got a hard lockup. And cgwb_release_workfn brought
> me here, let me know if it should go elsewhere.
> 
> [35824.733029] kernel: list_del corruption. next->prev should be
> ffff93e01fa2f550, but was 0000000000000000
> [35824.733085] kernel: ------------[ cut here ]------------
> [35824.733104] kernel: kernel BUG at lib/list_debug.c:54!
> [35824.733127] kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [35824.733149] kernel: CPU: 1 PID: 27905 Comm: kworker/1:2 Not tainted
> 5.16.5-200.fc35.x86_64 #1
> [35824.733179] kernel: Hardware name: LENOVO 20QDS3E200/20QDS3E200,
> BIOS N2HET66W (1.49 ) 11/10/2021
> [35824.733208] kernel: Workqueue: cgwb_release cgwb_release_workfn
> [35824.733234] kernel: RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47
> [35824.733260] kernel: Code: c7 c7 38 a8 64 91 e8 47 d8 fd ff 0f 0b 48
> 89 fe 48 c7 c7 c8 a8 64 91 e8 36 d8 fd ff 0f 0b 48 c7 c7 78 a9 64 91
> e8 28 d8 fd ff <0f> 0b 48 89 f2 48 89 fe 48 c7 c7 38 a9 64 91 e8 14 d8
> fd ff 0f 0b
> [35824.733322] kernel: RSP: 0018:ffffa710470ffe40 EFLAGS: 00010082
> [35824.733343] kernel: RAX: 0000000000000054 RBX: ffff93e01fa2f540
> RCX: 0000000000000000
> [35824.733370] kernel: RDX: 0000000000000002 RSI: ffffffff91634c5d
> RDI: 00000000ffffffff
> [35824.733396] kernel: RBP: 0000000000000202 R08: 0000000000000000
> R09: ffffa710470ffc88
> [35824.733423] kernel: R10: ffffa710470ffc80 R11: ffffffff91f462a8
> R12: 00000000ffffffff
> [35824.733449] kernel: R13: ffff93e0092f1000 R14: ffff93e01fa2f400
> R15: ffff93e36e879b05
> [35824.733475] kernel: FS:  0000000000000000(0000)
> GS:ffff93e36e840000(0000) knlGS:0000000000000000
> [35824.733505] kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [35824.733527] kernel: CR2: 00007fcbd0ef9e40 CR3: 0000000057e10001
> CR4: 00000000003726e0
> [35824.733553] kernel: Call Trace:
> [35824.733566] kernel:  <TASK>
> [35824.733577] kernel:  percpu_counter_destroy+0x24/0x80
> [35824.733599] kernel:  cgwb_release_workfn+0xf9/0x210
> [35824.733619] kernel:  process_one_work+0x1e5/0x3c0
> [35824.733639] kernel:  worker_thread+0x50/0x3b0
> [35824.733656] kernel:  ? rescuer_thread+0x350/0x350
> [35824.733674] kernel:  kthread+0x169/0x190
> [35824.733704] kernel:  ? set_kthread_struct+0x40/0x40
> [35824.733725] kernel:  ret_from_fork+0x1f/0x30
> [35824.733747] kernel:  </TASK>

It's difficult to tell with the available information. I'd be surprised if
it's a bug in the cgwb release path itself given that all the prior steps in
the release path ran fine - e.g. if it were a double free, it should have
triggered earlier. One possibility is something is overwriting the linked
pointer through use-after-free or whatever. The best way forward would be
finding a way to reproduce the problem.

Thanks.

-- 
tejun