Re: [syzbot] [mm?] [bcachefs?] KASAN: slab-out-of-bounds Read in folio_try_get

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Fri, 14 Feb 2025 15:57:08 -0500

On Fri, Feb 14, 2025 at 08:34:44PM +0000, Matthew Wilcox wrote:
> On Fri, Feb 14, 2025 at 11:59:27AM -0800, syzbot wrote:
> > BUG: KASAN: slab-out-of-bounds in instrument_atomic_read include/linux/instrumented.h:68 [inline]
> > BUG: KASAN: slab-out-of-bounds in atomic_read include/linux/atomic/atomic-instrumented.h:32 [inline]
> > BUG: KASAN: slab-out-of-bounds in page_ref_count include/linux/page_ref.h:67 [inline]
> > BUG: KASAN: slab-out-of-bounds in page_ref_add_unless include/linux/page_ref.h:237 [inline]
> > BUG: KASAN: slab-out-of-bounds in folio_ref_add_unless include/linux/page_ref.h:248 [inline]
> > BUG: KASAN: slab-out-of-bounds in folio_try_get+0xde/0x350 include/linux/page_ref.h:264
> > Read of size 4 at addr ffff88804f904b34 by task syz-executor127/5388
> > 
> > CPU: 0 UID: 0 PID: 5388 Comm: syz-executor127 Not tainted 6.14.0-rc2-syzkaller-00056-gab68d7eb7b1a #0
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
> > Call Trace:
> >  <TASK>
> >  __dump_stack lib/dump_stack.c:94 [inline]
> >  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
> >  print_address_description mm/kasan/report.c:378 [inline]
> >  print_report+0x169/0x550 mm/kasan/report.c:489
> >  kasan_report+0x143/0x180 mm/kasan/report.c:602
> >  kasan_check_range+0x282/0x290 mm/kasan/generic.c:189
> >  instrument_atomic_read include/linux/instrumented.h:68 [inline]
> >  atomic_read include/linux/atomic/atomic-instrumented.h:32 [inline]
> >  page_ref_count include/linux/page_ref.h:67 [inline]
> >  page_ref_add_unless include/linux/page_ref.h:237 [inline]
> >  folio_ref_add_unless include/linux/page_ref.h:248 [inline]
> >  folio_try_get+0xde/0x350 include/linux/page_ref.h:264
> >  filemap_get_entry+0x240/0x3b0 mm/filemap.c:1870
> >  shmem_get_folio_gfp+0x285/0x1840 mm/shmem.c:2446
> >  shmem_get_folio mm/shmem.c:2628 [inline]
> >  shmem_write_begin+0x165/0x350 mm/shmem.c:3278
> >  generic_perform_write+0x346/0x990 mm/filemap.c:4189
> >  shmem_file_write_iter+0xf9/0x120 mm/shmem.c:3454
> >  new_sync_write fs/read_write.c:586 [inline]
> >  vfs_write+0xacf/0xd10 fs/read_write.c:679
> >  ksys_write+0x18f/0x2b0 fs/read_write.c:731
> >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > RIP: 0033:0x7fb60d00ef1f
> > Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 19 81 02 00 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 6c 81 02 00 48
> > RSP: 002b:00007fb60c7b9fb0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> > RAX: ffffffffffffffda RBX: 00007fb60c7b9ff0 RCX: 00007fb60d00ef1f
> > RDX: 0000000001000000 RSI: 00007fb604200000 RDI: 0000000000000003
> > RBP: 00007fb60d0976e0 R08: 0000000000000000 R09: 000000000000590c
> > R10: 0000000000000002 R11: 0000000000000293 R12: 00007fb60d0976ec
> > R13: 00007fb60c7ba030 R14: 0000000000000003 R15: 00007ffe9f1d73d8
> >  </TASK>
> > 
> > The buggy address belongs to the object at ffff88804f904b00
> >  which belongs to the cache radix_tree_node of size 576
> > The buggy address is located 52 bytes inside of
> >  allocated 576-byte region [ffff88804f904b00, ffff88804f904d40)
> 
> Wait, what?  We're calling folio_try_get() on a pointer which isn't a
> pointer to a folio, but a pointer to somewhere in a radix_tree_node?
> 
> This fits a pattern we're seeing a lot of recently.
> Bugs:
> 
> https://syzkaller.appspot.com/bug?extid=b581c7106aa616bb522c
> https://syzkaller.appspot.com/bug?extid=8ae0902c29b15a27a4ee
> https://syzkaller.appspot.com/bug?extid=07392c132f11b1758ac3
> https://syzkaller.appspot.com/bug?extid=fe375f77ba1a6ab944b6
> https://syzkaller.appspot.com/bug?extid=a0ae55e3dde11d2d790c
> 
> They all fit the form of syzbot mounts a (potentially fuzzed?) bcachefs
> file system and later we have a corruption in the radix tree.
> 
> I have two suspicions (feel free to assign your own probabilities to which
> is correct).  The first is that a bunch of tweaky little cleanups went
> into the xarray code in the last merge window.  I really wish we could
> stop doing that kind of bullshit.  Can we just agree that the xarray
> code is good enough and not keep pissing with it?  Obviously if there's
> a bug, then we should fix it (and that should come with test cases!),
> but otherwise just leave it alone.  Please.  It would make finding this
> kind of problem much easier.

Oof. Sounds like an insufficient test suite if that stuff is making it
in? You should be able to just tell people "no more xarray patches until
the testing is improved".

> The second is that bcachefs has a random memory stomper.  That would
> suck.  Kent, you said you had some automated tooling to feed syzbot
> reproducers into?

Yeah, although on 6.14-rc1 the builds are failing with the kconfig that
syzbot built with, and bisect landed on a kbuild merge commit. Fun. It's
going to take some time digging through cpp -E output to figure out
what's going on.

btk -IP ~/ktest/tests/syzbot-repro.ktest <syz bug id>

(I really want the syzbot reproducers to run with a slimmed down ktest
generated .config, but syzbot reproducers have all sorts of random
dependencies so that's been a process).

I don't have any reason to suspect bcachefs has a memory stomper that
could cause this, but given the amount of syzbot bugs I still have to
get through I can't rule it out, and unfortunately it's looking like
it'll be a few weeks minimum before I can really start getting to the
syzbot bugs.