On Thu, Apr 01, 2021 at 07:46:13PM +0200, Christian Brauner wrote: > On Thu, Apr 01, 2021 at 10:09:18AM -0600, Jens Axboe wrote: > > On 4/1/21 9:45 AM, Christian Brauner wrote: > > > On Thu, Apr 01, 2021 at 02:09:20AM -0700, syzbot wrote: > > >> Hello, > > >> > > >> syzbot found the following issue on: > > >> > > >> HEAD commit: d19cc4bf Merge tag 'trace-v5.12-rc5' of git://git.kernel.o.. > > >> git tree: upstream > > >> console output: https://syzkaller.appspot.com/x/log.txt?x=1018f281d00000 > > >> kernel config: https://syzkaller.appspot.com/x/.config?x=d1a3d65a48dbd1bc > > >> dashboard link: https://syzkaller.appspot.com/bug?extid=c88a7030da47945a3cc3 > > >> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12f50d11d00000 > > >> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=137694a1d00000 > > >> > > >> IMPORTANT: if you fix the issue, please add the following tag to the commit: > > >> Reported-by: syzbot+c88a7030da47945a3cc3@xxxxxxxxxxxxxxxxxxxxxxxxx > > >> > > >> ------------[ cut here ]------------ > > >> WARNING: CPU: 1 PID: 8409 at fs/namespace.c:1186 mntput_no_expire+0xaca/0xcb0 fs/namespace.c:1186 > > >> Modules linked in: > > >> CPU: 1 PID: 8409 Comm: syz-executor035 Not tainted 5.12.0-rc5-syzkaller #0 > > >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 > > >> RIP: 0010:mntput_no_expire+0xaca/0xcb0 fs/namespace.c:1186 > > >> Code: ff 48 c7 c2 e0 cb 78 89 be c2 02 00 00 48 c7 c7 a0 cb 78 89 c6 05 e5 6d e5 0b 01 e8 ff e1 f6 06 e9 3f fd ff ff e8 c6 a5 a8 ff <0f> 0b e9 fc fc ff ff e8 ba a5 a8 ff e8 55 dc 94 ff 31 ff 89 c5 89 > > >> RSP: 0018:ffffc9000165fc78 EFLAGS: 00010293 > > >> RAX: 0000000000000000 RBX: 1ffff920002cbf95 RCX: 0000000000000000 > > >> RDX: ffff88802072d4c0 RSI: ffffffff81cb4b8a RDI: 0000000000000003 > > >> RBP: ffff888011656900 R08: 0000000000000000 R09: ffffffff8fa978af > > >> R10: ffffffff81cb4884 R11: 0000000000000000 R12: 0000000000000008 > > >> R13: ffffc9000165fcc8 R14: dffffc0000000000 R15: 00000000ffffffff > > >> FS: 0000000000000000(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000 > > >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > >> CR2: 000055a722053160 CR3: 000000000bc8e000 CR4: 00000000001506e0 > > >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > >> Call Trace: > > >> mntput fs/namespace.c:1232 [inline] > > >> cleanup_mnt+0x523/0x530 fs/namespace.c:1132 > > >> task_work_run+0xdd/0x1a0 kernel/task_work.c:140 > > >> exit_task_work include/linux/task_work.h:30 [inline] > > >> do_exit+0xbfc/0x2a60 kernel/exit.c:825 > > >> do_group_exit+0x125/0x310 kernel/exit.c:922 > > >> __do_sys_exit_group kernel/exit.c:933 [inline] > > >> __se_sys_exit_group kernel/exit.c:931 [inline] > > >> __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931 > > >> do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 > > >> entry_SYSCALL_64_after_hwframe+0x44/0xae > > >> RIP: 0033:0x446af9 > > >> Code: Unable to access opcode bytes at RIP 0x446acf. > > >> RSP: 002b:00000000005dfe48 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 > > >> RAX: ffffffffffffffda RBX: 00000000004ce450 RCX: 0000000000446af9 > > >> RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000001 > > >> RBP: 0000000000000001 R08: ffffffffffffffbc R09: 0000000000000000 > > >> R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004ce450 > > >> R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001 > > > > > > [+Cc Jens + io_uring] > > > > > > Hm, this reproducer uses io_uring and it's the io_uring_enter() that > > > triggers this reliably. With this reproducer I've managed to reproduce > > > the issue on v5.12-rc4, and v5.12-rc3, v5.12-rc2 and v5.12-rc1. > > > It's not reproducible at > > > 9820b4dca0f9c6b7ab8b4307286cdace171b724d > > > which is the commit immediately before the first v5.12 io_uring merge. > > > It's first reproducible with the first io_uring merge for v5.12, i.e. > > > 5bbb336ba75d95611a7b9456355b48705016bdb1 > > > > Thanks, that's good info. I'll take a look at it and see if I can > > reproduce. > > Ok, I was deep into this anyway and it didn't make much sense to do > anything else at that point so I bisected this a bit further. The first > bad commit is: > > commit 3a81fd02045c329f25e5900fa61f613c9b317644 > Author: Jens Axboe <axboe@xxxxxxxxx> > Date: Thu Dec 10 12:25:36 2020 -0700 > > io_uring: enable LOOKUP_CACHED path resolution for filename lookups > > Instead of being pessimistic and assume that path lookup will block, use > LOOKUP_CACHED to attempt just a cached lookup. This ensures that the > fast path is always done inline, and we only punt to async context if > IO is needed to satisfy the lookup. > > For forced nonblock open attempts, mark the file O_NONBLOCK over the > actual ->open() call as well. We can safely clear this again before > doing fd_install(), so it'll never be user visible that we fiddled with > it. > > This greatly improves the performance of file open where the dentry is > already cached: > > ached 5.10-git 5.10-git+LOOKUP_CACHED Speedup > --------------------------------------------------------------- > 33% 1,014,975 900,474 1.1x > 89% 545,466 292,937 1.9x > 100% 435,636 151,475 2.9x > > The more cache hot we are, the faster the inline LOOKUP_CACHED > optimization helps. This is unsurprising and expected, as a thread > offload becomes a more dominant part of the total overhead. If we look > at io_uring tracing, doing an IORING_OP_OPENAT on a file that isn't in > the dentry cache will yield: > > 275.550481: io_uring_create: ring 00000000ddda6278, fd 3 sq size 8, cq size 16, flags 0 > 275.550491: io_uring_submit_sqe: ring 00000000ddda6278, op 18, data 0x0, non block 1, sq_thread 0 > 275.550498: io_uring_queue_async_work: ring 00000000ddda6278, request 00000000c0267d17, flags 69760, normal queue, work 000000003d683991 > 275.550502: io_uring_cqring_wait: ring 00000000ddda6278, min_events 1 > 275.550556: io_uring_complete: ring 00000000ddda6278, user_data 0x0, result 4 > > which shows a failed nonblock lookup, then punt to worker, and then we > complete with fd == 4. This takes 65 usec in total. Re-running the same > test case again: > > 281.253956: io_uring_create: ring 0000000008207252, fd 3 sq size 8, cq size 16, flags 0 > 281.253967: io_uring_submit_sqe: ring 0000000008207252, op 18, data 0x0, non block 1, sq_thread 0 > 281.253973: io_uring_complete: ring 0000000008207252, user_data 0x0, result 4 > > shows the same request completing inline, also returning fd == 4. This > takes 6 usec. > > Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> I _think_ I see what the issue is. It seems that an assumption made in this commit might be wrong and we're missing a mnt_add_count() bump that we would otherwise have gotten if we've moved the failure handling into the unlazy helpers themselves. Al, does that sound plausible? commit eacd9aa8cedeb412842c7b339adbaa0477fdd5ad Author: Al Viro <viro@xxxxxxxxxxxxxxxxxx> Date: Mon Feb 15 12:03:23 2021 -0500 fix handling of nd->depth on LOOKUP_CACHED failures in try_to_unlazy* After switching to non-RCU mode, we want nd->depth to match the number of entries in nd->stack[] that need eventual path_put(). legitimize_links() takes care of that on failures; unfortunately, failure exits added for LOOKUP_CACHED do not. We could add the logics for that into those failure exits, both in try_to_unlazy() and in try_to_unlazy_next(), but since both checks are immediately followed by legitimize_links() and there's no calls of legitimize_links() other than those two... It's easier to move the check (and required handling of nd->depth on failure) into legitimize_links() itself. [caught by Jens: ... and since we are zeroing ->depth here, we need to do drop_links() first] Fixes: 6c6ec2b0a3e0 "fs: add support for LOOKUP_CACHED" Tested-by: Jens Axboe <axboe@xxxxxxxxx> Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx> diff --git a/fs/namei.c b/fs/namei.c index 4cae88733a5c..de74ad2bc6e2 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -630,6 +630,11 @@ static inline bool legitimize_path(struct nameidata *nd, static bool legitimize_links(struct nameidata *nd) { int i; + if (unlikely(nd->flags & LOOKUP_CACHED)) { + drop_links(nd); + nd->depth = 0; + return false; + } for (i = 0; i < nd->depth; i++) { struct saved *last = nd->stack + i; if (unlikely(!legitimize_path(nd, &last->link, last->seq))) { @@ -686,8 +691,6 @@ static bool try_to_unlazy(struct nameidata *nd) BUG_ON(!(nd->flags & LOOKUP_RCU)); nd->flags &= ~LOOKUP_RCU; - if (nd->flags & LOOKUP_CACHED) - goto out1; if (unlikely(!legitimize_links(nd))) goto out1; if (unlikely(!legitimize_path(nd, &nd->path, nd->seq))) @@ -724,8 +727,6 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsi BUG_ON(!(nd->flags & LOOKUP_RCU)); nd->flags &= ~LOOKUP_RCU; - if (nd->flags & LOOKUP_CACHED) - goto out2; if (unlikely(!legitimize_links(nd))) goto out2; if (unlikely(!legitimize_mnt(nd->path.mnt, nd->m_seq)))