On Wed 31-01-24 10:37:18, Dave Chinner wrote: > On Tue, Jan 30, 2024 at 06:52:21AM -0800, syzbot wrote: > > syzbot has found a reproducer for the following issue on: > > > > HEAD commit: 861c0981648f Merge tag 'jfs-6.8-rc3' of github.com:kleikam.. > > git tree: upstream > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=13ca8d97e80000 > > kernel config: https://syzkaller.appspot.com/x/.config?x=b0b9993d7d6d1990 > > dashboard link: https://syzkaller.appspot.com/bug?extid=cdee56dbcdf0096ef605 > > compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40 > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=104393efe80000 > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1393b90fe80000 > > > > Downloadable assets: > > disk image: https://storage.googleapis.com/syzbot-assets/7c6cc521298d/disk-861c0981.raw.xz > > vmlinux: https://storage.googleapis.com/syzbot-assets/6203c94955db/vmlinux-861c0981.xz > > kernel image: https://storage.googleapis.com/syzbot-assets/17e76e12b58c/bzImage-861c0981.xz > > mounted in repro: https://storage.googleapis.com/syzbot-assets/d31d4eed2912/mount_3.gz > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > Reported-by: syzbot+cdee56dbcdf0096ef605@xxxxxxxxxxxxxxxxxxxxxxxxx > > > > general protection fault, probably for non-canonical address 0xdffffc000a8a4829: 0000 [#1] PREEMPT SMP KASAN > > KASAN: probably user-memory-access in range [0x0000000054524148-0x000000005452414f] > > CPU: 1 PID: 5065 Comm: syz-executor260 Not tainted 6.8.0-rc2-syzkaller-00031-g861c0981648f #0 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023 > > RIP: 0010:jbd2__journal_start+0x87/0x5d0 fs/jbd2/transaction.c:496 > > Code: 74 63 48 8b 1b 48 85 db 74 79 48 89 d8 48 c1 e8 03 42 80 3c 28 00 74 08 48 89 df e8 63 4d 8f ff 48 8b 2b 48 89 e8 48 c1 e8 03 <42> 80 3c 28 00 74 08 48 89 ef e8 4a 4d 8f ff 4c 39 65 00 0f 85 1a > > RSP: 0018:ffffc900043265c8 EFLAGS: 00010203 > > RAX: 000000000a8a4829 RBX: ffff8880205fa3a8 RCX: ffff8880235dbb80 > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff88801c1a6000 > > RBP: 000000005452414e R08: 0000000000000c40 R09: 0000000000000001 > ^^^^^^^^ > Hmmmm - TRAN. That's looks suspicious, I'll come back to that. Indeed, thanks for the great analysis. <snip analysis> > The question here is what to do about this? The obvious solution is > to have save/restore semantics in the filesystem code that > sets/clears current->journal_info, and then filesystems can also do > the necessary "recursion into same filesystem" checks they need to > ensure that they aren't nesting transactions in a way that can > deadlock. As others have mentioned, this seems potentially dangerous because that just hides potential deadlocks. E.g. for ext4 taking a page fault while having a transaction started violates lock ordering requirements (mapping->invalidate_lock > transaction start). OTOH we have lockdep tracking for this anyway so I guess we don't care too much for ext4. > Maybe there are other options - should filesystems even be allowed to > trigger page faults when they have set current->journal_info? For ext4 it would definitely be a bug if this happens and it is not only about usage of current->journal_info as I wrote above. > What other potential avenues are there that could cause this sort of > transaction context nesting that we haven't realised exist? Does > ext4 data=jounral have problems like this in the data copy-in path? > What other filesystems allow page faults in transaction contexts? So I'm reasonably confident we aren't hitting any such path in ext4 as lockdep would tell us about it (we treat transaction start as lock acquisition in jbd2 and tell lockdep about it). For the write path, we are relying on VFS prefaulting pages before calling ->write_begin (where we start a transaction) and then doing atomic copy. For the read path we don't start any transaction (except for possible atime update but that's just a tiny transaction on the side after the read completes). So ext4 on its own should be fine. But we have also btrfs, ceph, gfs2, nilfs2, ocfs2, and reiserfs using current->journal_info so overall chances for interaction are ... non-trivial. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR