On Wed, Jan 31, 2024 at 10:37:18AM +1100, Dave Chinner wrote: > On Tue, Jan 30, 2024 at 06:52:21AM -0800, syzbot wrote: > > syzbot has found a reproducer for the following issue on: > > > > HEAD commit: 861c0981648f Merge tag 'jfs-6.8-rc3' of github.com:kleikam.. > > git tree: upstream > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=13ca8d97e80000 > > kernel config: https://syzkaller.appspot.com/x/.config?x=b0b9993d7d6d1990 > > dashboard link: https://syzkaller.appspot.com/bug?extid=cdee56dbcdf0096ef605 > > compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40 > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=104393efe80000 > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1393b90fe80000 > > > > Downloadable assets: > > disk image: https://storage.googleapis.com/syzbot-assets/7c6cc521298d/disk-861c0981.raw.xz > > vmlinux: https://storage.googleapis.com/syzbot-assets/6203c94955db/vmlinux-861c0981.xz > > kernel image: https://storage.googleapis.com/syzbot-assets/17e76e12b58c/bzImage-861c0981.xz > > mounted in repro: https://storage.googleapis.com/syzbot-assets/d31d4eed2912/mount_3.gz > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > Reported-by: syzbot+cdee56dbcdf0096ef605@xxxxxxxxxxxxxxxxxxxxxxxxx > > > > general protection fault, probably for non-canonical address 0xdffffc000a8a4829: 0000 [#1] PREEMPT SMP KASAN > > KASAN: probably user-memory-access in range [0x0000000054524148-0x000000005452414f] > > CPU: 1 PID: 5065 Comm: syz-executor260 Not tainted 6.8.0-rc2-syzkaller-00031-g861c0981648f #0 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023 > > RIP: 0010:jbd2__journal_start+0x87/0x5d0 fs/jbd2/transaction.c:496 > > Code: 74 63 48 8b 1b 48 85 db 74 79 48 89 d8 48 c1 e8 03 42 80 3c 28 00 74 08 48 89 df e8 63 4d 8f ff 48 8b 2b 48 89 e8 48 c1 e8 03 <42> 80 3c 28 00 74 08 48 89 ef e8 4a 4d 8f ff 4c 39 65 00 0f 85 1a > > RSP: 0018:ffffc900043265c8 EFLAGS: 00010203 > > RAX: 000000000a8a4829 RBX: ffff8880205fa3a8 RCX: ffff8880235dbb80 > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff88801c1a6000 > > RBP: 000000005452414e R08: 0000000000000c40 R09: 0000000000000001 > ^^^^^^^^ > Hmmmm - TRAN. That's looks suspicious, I'll come back to that. > > > R10: dffffc0000000000 R11: ffffed1003834871 R12: ffff88801c1a6000 > > R13: dffffc0000000000 R14: 0000000000000c40 R15: 0000000000000002 > > FS: 0000555556f90380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 0000000020020000 CR3: 0000000021fed000 CR4: 00000000003506f0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > Call Trace: > > <TASK> > > __ext4_journal_start_sb+0x215/0x5b0 fs/ext4/ext4_jbd2.c:112 > > __ext4_journal_start fs/ext4/ext4_jbd2.h:326 [inline] > > ext4_dirty_inode+0x92/0x110 fs/ext4/inode.c:5969 > > __mark_inode_dirty+0x305/0xda0 fs/fs-writeback.c:2452 > > generic_update_time fs/inode.c:1905 [inline] > > inode_update_time fs/inode.c:1918 [inline] > > __file_update_time fs/inode.c:2106 [inline] > > file_update_time+0x39b/0x3e0 fs/inode.c:2136 > > ext4_page_mkwrite+0x207/0xdf0 fs/ext4/inode.c:6090 > > do_page_mkwrite+0x197/0x470 mm/memory.c:2966 > > wp_page_shared mm/memory.c:3353 [inline] > > do_wp_page+0x20e3/0x4c80 mm/memory.c:3493 > > handle_pte_fault mm/memory.c:5160 [inline] > > __handle_mm_fault+0x26a3/0x72b0 mm/memory.c:5285 > > handle_mm_fault+0x27e/0x770 mm/memory.c:5450 > > do_user_addr_fault arch/x86/mm/fault.c:1415 [inline] > > handle_page_fault arch/x86/mm/fault.c:1507 [inline] > > exc_page_fault+0x2ad/0x870 arch/x86/mm/fault.c:1563 > > asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:570 > > EXt4 is triggering a BUG_ON: > > handle_t *handle = journal_current_handle(); > int err; > > if (!journal) > return ERR_PTR(-EROFS); > > if (handle) { > >>>>>>>>> J_ASSERT(handle->h_transaction->t_journal == journal); > handle->h_ref++; > return handle; > } > > via a journal assert failure. It appears that current->journal_info > isn't what it is supposed to be. It's finding something with TRAN in > it, I think. I'll come back to this. > > What syzbot is doing is creating a file on it's root filesystem and > write()ing 0x208e24b bytes (zeroes from an anonymous mmap() region, > I think) to it to initialise it's contents. > > Then it mmap()s the ext4 file for 0xb36000 bytes and copies the raw > test filesystem image in the source code into it. It then creates a > memfd that it decompresses the data in the mapped ext4 file into and > creates a loop device that points to that memfd. It then mounts the > loop device and we get an XFS filesystem which doesn't appear to > contain any corruptions in it at all. It then runs a bulkstat pass > on the the XFS filesystem. > > This is where it gets interesting. The user buffer that XFS > copies the inode data into points to a memory address inside the > range of the ext4 file that the filesystem image was copied to. > It does not overlap with the filesystem image. > > Hence when XFS goes to copy the inodes into the user buffer, it > triggers write page faults on the ext4 backing file. > > That's this part of the trace: > > > > RIP: 0010:rep_movs_alternative+0x4a/0x70 arch/x86/lib/copy_user_64.S:71 > > Code: 75 f1 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 06 48 89 07 48 83 c6 08 48 83 c7 08 83 e9 08 74 df 83 f9 08 73 e8 eb c9 <f3> a4 c3 48 89 c8 48 c1 e9 03 83 e0 07 f3 48 a5 89 c1 85 c9 75 b3 > > RSP: 0018:ffffc900043270f8 EFLAGS: 00050202 > > RAX: ffffffff848cda01 RBX: 0000000020020040 RCX: 0000000000000040 > > RDX: 0000000000000000 RSI: ffff8880131873b0 RDI: 0000000020020000 > > RBP: 1ffff92000864f26 R08: ffff8880131873ef R09: 1ffff11002630e7d > > R10: dffffc0000000000 R11: ffffed1002630e7e R12: 00000000000000c0 > > R13: dffffc0000000000 R14: 000000002001ff80 R15: ffff888013187330 > > copy_user_generic arch/x86/include/asm/uaccess_64.h:112 [inline] > > raw_copy_to_user arch/x86/include/asm/uaccess_64.h:133 [inline] > > _copy_to_user+0x86/0xa0 lib/usercopy.c:41 > > copy_to_user include/linux/uaccess.h:191 [inline] > > xfs_bulkstat_fmt+0x4f/0x120 fs/xfs/xfs_ioctl.c:744 > > xfs_bulkstat_one_int+0xd8b/0x12e0 fs/xfs/xfs_itable.c:161 > > xfs_bulkstat_iwalk+0x72/0xb0 fs/xfs/xfs_itable.c:239 > > xfs_iwalk_ag_recs+0x4c3/0x820 fs/xfs/xfs_iwalk.c:220 > > xfs_iwalk_run_callbacks+0x25b/0x490 fs/xfs/xfs_iwalk.c:376 > > xfs_iwalk_ag+0xad6/0xbd0 fs/xfs/xfs_iwalk.c:482 > > xfs_iwalk+0x360/0x6f0 fs/xfs/xfs_iwalk.c:584 > > xfs_bulkstat+0x4f8/0x6c0 fs/xfs/xfs_itable.c:308 > > xfs_ioc_bulkstat+0x3d0/0x450 fs/xfs/xfs_ioctl.c:867 > > xfs_file_ioctl+0x6a5/0x1980 fs/xfs/xfs_ioctl.c:1994 > > vfs_ioctl fs/ioctl.c:51 [inline] > > __do_sys_ioctl fs/ioctl.c:871 [inline] > > __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0xf5/0x230 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x63/0x6b > > What is interesting here is this is running in an empty XFS > transaction context so that the bulkstat operation garbage collects > all the buffers it accesses without us having to explicit cleanup - > they all get released when we cancel the transaction context at the > end of the process. > > But that means copy-out is running inside a transaction context, and > that means current->journal_info contains a pointer to the current > struct xfs_trans the bulkstat operation is using. > > Guess what we have as the first item in a struct xfs_trans? That's > right, it's a magic number, and that magic number is: > > #define XFS_TRANS_HEADER_MAGIC 0x5452414e /* TRAN */ > > It should be obvious what has happened now - > current->journal_info is not null, so ext4 thinks it owns the > structure attached there and panics when it finds that it isn't an > ext4 journal handle being held there. > > I don't think there are any clear rules as to how filesystems can > and can't use current->journal_info. In general, a task can't jump > from one filesystem to another inside a transaction context like > this, so there's never been a serious concern about nested > current->journal_info assignments like this in the past. > > XFS is doing nothing wrong - we're allowed to define transaction > contexts however we want and use current->journal_info in this way. > However, we have to acknowledge that ext4 has also done nothing > wrong by assuming current->journal_info should below to it if it is > not null. Indeed, XFS does the same thing. Getting late here, so this will be pretty terse-- Thinking narrowly about just xfs, I think this means that the bulkstat/inumbers implementations need to allocate a bounce buffer to format records into so that it can copy_to_user without any locks held. We have no idea if the destination page is a file page or anonymous memory or whatever. Or: Do we really need to set current->journal_info for empty transactions? > The question here is what to do about this? The obvious solution is > to have save/restore semantics in the filesystem code that > sets/clears current->journal_info, and then filesystems can also do > the necessary "recursion into same filesystem" checks they need to > ensure that they aren't nesting transactions in a way that can > deadlock. We don't know what locks might be held by whatever code set journal_info. I don't see how we could push it aside sanely? > Maybe there are other options - should filesystems even be allowed to > trigger page faults when they have set current->journal_info? I wonder if we ought to be checking current->journal_info in our own page_mkwrite handler and throwing back an errno? I don't think we want to go down the rabbithele of "someone else was running a transaction, maybe we can proceed with an update anyway???". > What other potential avenues are there that could cause this sort of > transaction context nesting that we haven't realised exist? Does > ext4 data=jounral have problems like this in the data copy-in path? > What other filesystems allow page faults in transaction contexts? Ugh, whackamole. :/ --D > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx >