Re: [syzbot] [mm?] KCSAN: data-race in mas_wr_store_entry / mtree_range_walk

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Mon, 23 Sep 2024 14:50:59 +0100

+ Suren for VMA lock aspects.

On Mon, Sep 23, 2024 at 10:44:34AM GMT, Lorenzo Stoakes wrote:
> On Mon, Sep 23, 2024 at 02:04:23AM GMT, syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:    88264981f208 Merge tag 'sched_ext-for-6.12' of git://git.k..
> > git tree:       upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=1237ec27980000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=e6702f5f2b8ed242
> > dashboard link: https://syzkaller.appspot.com/bug?extid=e01fa33e67abb0b3b3bb
> > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> >
>
> Thanks for the report, investigating.

TL;DR: I think there's a race, but it's in code that detects and handles
races, we probably need to add a data_race() or similar annotation.

I'll leave it to Liam (who is currently on leave) to address annotating
this if appropriate on his return, unless this becomes more urgent in the
meantime.

>
> > Unfortunately, I don't have any reproducer for this issue yet.
>
> I suspect given this is so timing-specific, a reproducer might be difficult.
>
> >
> > Downloadable assets:
> > disk image: https://storage.googleapis.com/syzbot-assets/95bba355b2ed/disk-88264981.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/75966f4e5286/vmlinux-88264981.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/7f1578876250/bzImage-88264981.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+e01fa33e67abb0b3b3bb@xxxxxxxxxxxxxxxxxxxxxxxxx
> >
> > ==================================================================
> > BUG: KCSAN: data-race in mas_wr_store_entry / mtree_range_walk
> >
> > write to 0xffff888114555710 of 8 bytes by task 9573 on cpu 1:

This is:

			wr_mas->pivots[offset] = mas->index - 1;

> >  mas_wr_slot_store lib/maple_tree.c:3889 [inline]
> >  mas_wr_store_entry+0x146b/0x2d00 lib/maple_tree.c:4075
> >  mas_store_prealloc+0x6bf/0x960 lib/maple_tree.c:5520
> >  vma_iter_store mm/vma.h:470 [inline]
> >  commit_merge+0x441/0x740 mm/vma.c:609
> >  vma_expand+0x211/0x360 mm/vma.c:1024
> >  vma_merge_new_range+0x2cf/0x3e0 mm/vma.c:963
> >  mmap_region+0x887/0x16e0 mm/mmap.c:1416
> >  do_mmap+0x718/0xb60 mm/mmap.c:496
> >  vm_mmap_pgoff+0x133/0x290 mm/util.c:588
> >  ksys_mmap_pgoff+0xd0/0x330 mm/mmap.c:542
> >  x64_sys_call+0x1884/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:10
> >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> >  do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83
> >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> >
> > read to 0xffff888114555710 of 8 bytes by task 9574 on cpu 0:
> >  mtree_range_walk+0x1b4/0x460 lib/maple_tree.c:2779

This is:

			if (pivots[offset] >= mas->index) {

So we appear to be racing on this.

But... I don't think there's an actual race issue here, just maybe a need
for a data_race() wrapper or similar.

The lock_vma_under_rcu() call is an optimistic attempt to obtain a VMA, and
will try again if it is unable to get a stable reference to a VMA.

On vma_start_write() which the above stack writing to the maple tree
invokes prior to writing the maple tree, we lock the VMA for writing, and
set the vma->vm_lock_seq value equal to vma->vm_mm->mm_lock_seq, before
unlocking.

So we know that vma->vm_lock_seq == vma->vm_mm->mm_lock_seq right up until
vma_end_write_all() is called after the write is done (via
mmap_write_unlock()).

Then, in lock_vma_under_rcu(), vma_start_read() is called which bails out
if vma->vm_lock_seq == vma->vm_mm->mm_lock_seq, which it likely would be.

If you were _very_ unlucky with timing you might still get the lock on the
modified VMA, however if it was detached by then, it would have been marked
detached and lock_vma_under_rcu() will bail out, finally if we got the VMA
and it changed (e.g. expand above), we explicitly check that the address is
still in range - if it is, we're perfectly good to use it.

If lock_vma_under_rcu() bails it resorts to the slow path, no harm no foul.

I don't know why this was triggered with recent changes, maybe a timing
thing, but I don't _think_ we have an issue here. No part of this code path
really should be different here.

If we have further reports in this area or any indication of a deeper issue
I'll dive back in.

> >  mas_state_walk lib/maple_tree.c:3601 [inline]
> >  mas_walk+0x16e/0x320 lib/maple_tree.c:4948
> >  lock_vma_under_rcu+0x95/0x260 mm/memory.c:6224
> >  do_user_addr_fault arch/x86/mm/fault.c:1329 [inline]
> >  handle_page_fault arch/x86/mm/fault.c:1481 [inline]
> >  exc_page_fault+0x150/0x650 arch/x86/mm/fault.c:1539
> >  asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:623
> >
> > value changed: 0x00007f8311576fff -> 0xffffffff8529a680
>
> This suggests we are failing to acquire an RCU lock on mmap() (though we
> have the write mmap lock).
>
> Maybe we missed an RCU lock at some point, but I'm a little baffled as to
> what could have changed in recent series to adjust this.
>
> I will dig into this and see what's going on.
>
> >
> > Reported by Kernel Concurrency Sanitizer on:
> > CPU: 0 UID: 0 PID: 9574 Comm: syz.0.2084 Tainted: G        W          6.11.0-syzkaller-08481-g88264981f208 #0
> > Tainted: [W]=WARN
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024
> > ==================================================================
> >
> >
> > ---
> > This report is generated by a bot. It may contain errors.
> > See https://goo.gl/tpsmEJ for more information about syzbot.
> > syzbot engineers can be reached at syzkaller@xxxxxxxxxxxxxxxx.
> >
> > syzbot will keep track of this issue. See:
> > https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> >
> > If the report is already addressed, let syzbot know by replying with:
> > #syz fix: exact-commit-title
> >
> > If you want to overwrite report's subsystems, reply with:
> > #syz set subsystems: new-subsystem
> > (See the list of subsystem names on the web dashboard)
> >
> > If the report is a duplicate of another one, reply with:
> > #syz dup: exact-subject-of-another-report
> >
> > If you want to undo deduplication, reply with:
> > #syz undup