On Feb 18, 2025, at 01:12, Kairui Song <ryncsn@xxxxxxxxx> wrote: > > On Mon, Feb 17, 2025 at 12:13 AM Kairui Song <ryncsn@xxxxxxxxx> wrote: >> >> On Sat, Feb 15, 2025 at 7:24 AM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: >>> >>> On Fri, 14 Feb 2025 10:11:19 -0800 syzbot <syzbot+38a0cbd267eff2d286ff@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote: >>> >>>> syzbot has found a reproducer for the following issue on: >>> >>> Thanks. I doubt if bcachefs is implicated in this? >>> >>>> HEAD commit: 128c8f96eb86 Merge tag 'drm-fixes-2025-02-14' of https://g.. >>>> git tree: upstream >>>> console output: https://syzkaller.appspot.com/x/log.txt?x=148019a4580000 >>>> kernel config: https://syzkaller.appspot.com/x/.config?x=c776e555cfbdb82d >>>> dashboard link: https://syzkaller.appspot.com/bug?extid=38a0cbd267eff2d286ff >>>> compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40 >>>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12328bf8580000 >>>> >>>> Downloadable assets: >>>> disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7feb34a89c2a/non_bootable_disk-128c8f96.raw.xz >>>> vmlinux: https://storage.googleapis.com/syzbot-assets/a97f78ac821e/vmlinux-128c8f96.xz >>>> kernel image: https://storage.googleapis.com/syzbot-assets/f451cf16fc9f/bzImage-128c8f96.xz >>>> mounted in repro: https://storage.googleapis.com/syzbot-assets/a7da783f97cf/mount_3.gz >>>> >>>> IMPORTANT: if you fix the issue, please add the following tag to the commit: >>>> Reported-by: syzbot+38a0cbd267eff2d286ff@xxxxxxxxxxxxxxxxxxxxxxxxx >>>> >>>> ------------[ cut here ]------------ >>>> WARNING: CPU: 0 PID: 5459 at mm/list_lru.c:96 lock_list_lru_of_memcg+0x39e/0x4d0 mm/list_lru.c:96 >>> >>> VM_WARN_ON(!css_is_dying(&memcg->css)); >> >> I'm checking this, when last time this was triggered, it was caused by >> a list_lru user did not initialize the memcg list_lru properly before >> list_lru reclaim started, and fixed by: >> https://lore.kernel.org/all/20241222122936.67501-1-ryncsn@xxxxxxxxx/T/ >> >> This shouldn't be a big issue, maybe there are leaks that will be >> fixed upon reparenting, and this new added sanity check might be too >> lenient, I'm not 100% sure though. >> >> Unfortunately I couldn't reproduce the issue locally with the >> reproducer yet. will keep the test running and see if it can hit this >> WARN_ON. > > So far I am still unable to trigger this VM_WARN_ON using the > reproducer, and I'm seeing many other random crashes. > > But after I changed the .config a bit adding more debug configs > (SLAB_FREELIST_HARDENED, DEBUG_PAGEALLOC), following crash showed up > and will be triggered immediately after I start the test: > > [ T1242] BUG: unable to handle page fault for address: ffff888054c60000 > [ T1242] #PF: supervisor read access in kernel mode > [ T1242] #PF: error_code(0x0000) - not-present page > [ T1242] PGD 19e01067 P4D 19e01067 PUD 19e04067 PMD 7fc5c067 PTE > 800fffffab39f060 > [ T1242] Oops: Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN PTI > [ T1242] CPU: 1 UID: 0 PID: 1242 Comm: kworker/1:1H Not tainted > 6.14.0-rc2-00185-g128c8f96eb86 #2 > [ T1242] Hardware name: Red Hat KVM/RHEL-AV, BIOS > 1.16.0-4.module+el8.8.0+664+0a3d6c83 04/01/2014 > [ T1242] Workqueue: bcachefs_btree_read_complete btree_node_read_work > [ T1242] RIP: 0010:validate_bset_keys+0xae3/0x14f0 > [ T6058] bcachefs (loop2): empty btree root xattrs > [ T1242] Code: 49 39 df 0f 87 fc 09 00 00 e8 79 54 a8 fd 41 0f b7 c6 > 48 8b 4c 24 68 48 8d 04 c1 4c 29 f8 48 c1 e8 03 89 c1 48 89 de 4c 89 > ff <f3> 48 a5 48 8b bc 24 c8 00 00 08 > [ T1242] RSP: 0018:ffffc900070a72c0 EFLAGS: 00010206 > [ T1242] RAX: 000000000000ec0f RBX: ffff888054c20110 RCX: 0000000000006c31 > [ T1242] RDX: 0000000000000000 RSI: ffff888054c60000 RDI: ffff888054c5ff90 > [ T1242] RBP: ffffc900070a7570 R08: ffff888065e001af R09: 1ffff1100cbc0035 > [ T1242] R10: dffffc0000000000 R11: ffffed100cbc0036 R12: ffff888054c2009e > [ T1242] R13: dffffc0000000000 R14: 000000000000ec0f R15: ffff888054c200a0 > [ T1242] FS: 0000000000000000(0000) GS:ffff88807ea00000(0000) > knlGS:0000000000000000 > [ T1242] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ T1242] CR2: ffff888054c60000 CR3: 000000006cea6000 CR4: 00000000000006f0 > [ T1242] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ T1242] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ T1242] Call Trace: > [ T1242] <TASK> > [ T1242] bch2_btree_node_read_done+0x1d20/0x53a0 > [ T1242] btree_node_read_work+0x54d/0xdc0 > [ T1242] process_scheduled_works+0xaf8/0x17f0 > [ T1242] worker_thread+0x89d/0xd60 > [ T1242] kthread+0x722/0x890 > [ T1242] ret_from_fork+0x4e/0x80 > [ T1242] ret_from_fork_asm+0x1a/0x30 > [ T1242] </TASK> > [ T1242] Modules linked in: > [ T1242] ---[ end trace 0000000000000000 ]--- > [ T1242] RIP: 0010:validate_bset_keys+0xae3/0x14f0 > [ T1242] Code: 49 39 df 0f 87 fc 09 00 00 e8 79 54 a8 fd 41 0f b7 c6 > 48 8b 4c 24 68 48 8d 04 c1 4c 29 f8 48 c1 e8 03 89 c1 48 89 de 4c 89 > ff <f3> 48 a5 48 8b bc 24 c8 00 00 08 > [ T1242] RSP: 0018:ffffc900070a72c0 EFLAGS: 00010206 > [ T1242] RAX: 000000000000ec0f RBX: ffff888054c20110 RCX: 0000000000006c31 > [ T1242] RDX: 0000000000000000 RSI: ffff888054c60000 RDI: ffff888054c5ff90 > [ T1242] RBP: ffffc900070a7570 R08: ffff888065e001af R09: 1ffff1100cbc0035 > [ T1242] R10: dffffc0000000000 R11: ffffed100cbc0036 R12: ffff888054c2009e > [ T1242] R13: dffffc0000000000 R14: 000000000000ec0f R15: ffff888054c200a0 > [ T1242] FS: 0000000000000000(0000) GS:ffff88807ea00000(0000) > knlGS:0000000000000000 > [ T1242] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ T1242] CR2: ffff888054c60000 CR3: 000000006cea6000 CR4: 00000000000006f0 > [ T1242] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ T1242] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ T1242] Kernel panic - not syncing: Fatal exception > [ T1242] Kernel Offset: disabled > [ T1242] Rebooting in 86400 seconds.. > > It's caused by the memmove_u64s_down in validate_bset_keys of > fs/bcachefs/btree_io.c: > -> memmove_u64s_down(k, bkey_p_next(k), (u64 *) vstruct_end(i) - (u64 *) k); Might need this. diff --git a/fs/bcachefs/btree_io.c b/fs/bcachefs/btree_io.c index e71b278672b6..fb53174cb735 100644 --- a/fs/bcachefs/btree_io.c +++ b/fs/bcachefs/btree_io.c @@ -997,7 +997,7 @@ static int validate_bset_keys(struct bch_fs *c, struct btree *b, } got_good_key: le16_add_cpu(&i->u64s, -next_good_key); - memmove_u64s_down(k, bkey_p_next(k), (u64 *) vstruct_end(i) - (u64 *) k); + memmove_u64s_down(k, bkey_p_next(k), (u64 *) vstruct_end(i) - (u64 *) bkey_p_next(k)); set_btree_node_need_rewrite(b); } fsck_err: > > The bkey_p_next(k) is RSI: ffff888054c60000 and it's causing an out of > border access. > (u64 *) vstruct_end(i) - (u64 *) k is RCX: 0000000000006c31, if added > to RDI this should cause an out of border write as well. > > This seems to indicate there is an out of border memory modification? > And maybe it corrupted other subsystems? The slight change to .config > changed the layout so it's causing a fault, maybe previously this just > went on silently. > I don't know much about bcachefs, will be grateful if bcachefs people > could help have a look. >