On Mon, Apr 05, 2021 at 04:18:58PM +0000, Al Viro wrote: > On Mon, Apr 05, 2021 at 01:44:37PM +0200, Christian Brauner wrote: > > On Sun, Apr 04, 2021 at 08:17:21PM +0000, Al Viro wrote: > > > On Sun, Apr 04, 2021 at 06:50:10PM +0000, Al Viro wrote: > > > > > > > > Yeah, I have at least namei.o > > > > > > > > > > https://drive.google.com/file/d/1AvO1St0YltIrA86DXjp1Xg3ojtS9owGh/view?usp=sharing > > > > > > > > *grumble* > > > > > > > > Is it reproducible without KASAN? Would be much easier to follow the produced > > > > asm... > > > > > > Looks like inode_permission(_, NULL, _) from may_lookup(nd). I.e. > > > nd->inode == NULL. > > > > Yeah, I already saw that. > > > > > > > > Mind slapping BUG_ON(!nd->inode) right before may_lookup() call in > > > link_path_walk() and trying to reproduce that oops? > > > > Yep, no problem. If you run the reproducer in a loop for a little while > > you eventually trigger the BUG_ON() and then you get the following splat > > (and then an endless loop) in [1] with nd->inode NULL. > > > > _But_ I managed to debug this further and was able to trigger the BUG_ON() > > directly in path_init() in the AT_FDCWD branch (after all its AT_FDCWD(./file0) > > with the patch in [3] (it's in LOOKUP_RCU) the corresponding splat is in [2]. > > So the crash happens for a PF_IO_WORKER thread with a NULL nd->inode for the > > PF_IO_WORKER's pwd (The PF_IO_WORKER seems to be in async context.). > > So we find current->fs->pwd.dentry negative, with current->fs->seq sampled > equal before and after that? Lovely... The only places where we assign > anything to ->pwd.dentry are > void set_fs_pwd(struct fs_struct *fs, const struct path *path) > { > struct path old_pwd; > > path_get(path); > spin_lock(&fs->lock); > write_seqcount_begin(&fs->seq); > old_pwd = fs->pwd; > fs->pwd = *path; > write_seqcount_end(&fs->seq); > spin_unlock(&fs->lock); > > if (old_pwd.dentry) > path_put(&old_pwd); > } > where we have ->seq bumped between dget new/assignment/ dput old, > copy_fs_struct() where we have > spin_lock(&old->lock); > fs->root = old->root; > path_get(&fs->root); > fs->pwd = old->pwd; > path_get(&fs->pwd); > spin_unlock(&old->lock); > fs being freshly allocated instance that couldn't have been observed > by anyone and chroot_fs_refs(), where we have > spin_lock(&fs->lock); > write_seqcount_begin(&fs->seq); > hits += replace_path(&fs->root, old_root, new_root); > hits += replace_path(&fs->pwd, old_root, new_root); > write_seqcount_end(&fs->seq); > while (hits--) { > count++; > path_get(new_root); > } > spin_unlock(&fs->lock); > ... > static inline int replace_path(struct path *p, const struct path *old, const struct path *new) > { > if (likely(p->dentry != old->dentry || p->mnt != old->mnt)) > return 0; > *p = *new; > return 1; > } > Here we have new_root->dentry pinned from the very beginning, > and assignments are wrapped into bumps of ->seq. Moreover, > we are holding ->lock through that sequence (as all writers > do), so these references can't be dropped before path_get() > bumps new_root->dentry refcount. > > chroot_fs_refs() is called only by pivot_root(2): > chroot_fs_refs(&root, &new); > and there new is set by > error = user_path_at(AT_FDCWD, new_root, > LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &new); > if (error) > goto out0; > which pins new.dentry *and* verifies that it's positive and a directory, > at that. Since pinned positive dentry can't be made negative by anybody > else, we know it will remain in that state until > path_put(&new); > well downstream of chroot_fs_refs(). In copy_fs_struct() we are > copying someone's ->pwd, so it's also pinned positive. And it > won't be dropped outside of old->lock, so by the time somebody > manages to drop the reference in old, path_get() effects will be > visible (old->lock serving as a barrier). > > That leaves set_fs_pwd() calls: > fs/init.c:54: set_fs_pwd(current->fs, &path); > init_chdir(), path set by LOOKUP_DIRECTORY patwalk. Pinned positive. > fs/namespace.c:4207: set_fs_pwd(current->fs, &root); > init_mount_tree(), root.dentry being ->mnt_root of rootfs. Pinned > positive (and it would've oopsed much earlier had that been it) > fs/namespace.c:4485: set_fs_pwd(fs, &root); > mntns_install(), root filled by successful LOOKUP_DOWN for "/" > from mnt_ns->root. Should be pinned positive. > fs/open.c:501: set_fs_pwd(current->fs, &path); > chdir(2), path set by LOOKUP_DIRECTORY pathwalk. Pinned positive. > fs/open.c:528: set_fs_pwd(current->fs, &f.file->f_path); > fchdir(2), file->f_path of any opened file. Pinned positive. > kernel/usermode_driver.c:130: set_fs_pwd(current->fs, &umd_info->wd); > umd_setup(), ->wd.dentry equal to ->wd.mnt->mnt_root, should be pinned positive. > kernel/nsproxy.c:509: set_fs_pwd(me->fs, &nsset->fs->pwd); > commit_nsset(). Let's see what's going on there... > > if ((flags & CLONE_NEWNS) && (flags & ~CLONE_NEWNS)) { > set_fs_root(me->fs, &nsset->fs->root); > set_fs_pwd(me->fs, &nsset->fs->pwd); > } > In those conditions nsset.fs has come from copy_fs_struct() done in > prepare_nsset(). And the only thing that might've been done to it > would be those set_fs_pwd() in mntns_install() (I'm not fond of the > entire nsset->fs thing - looks like papering over bad calling > conventions, but anyway) > > Now, I might've missed some insanity (direct assignments to ->pwd.dentry, > etc. - wouldn't be the first time io_uring folks went "layering? wassat? > we'll just poke in whatever we can reach"), but I don't see anything > obvious of that sort in the area... > > OK, how about this: in path_init(), right after > do { > seq = read_seqcount_begin(&fs->seq); > nd->path = fs->pwd; > nd->inode = nd->path.dentry->d_inode; > nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); > } while (read_seqcount_retry(&fs->seq, seq)); > slap > if (!nd->inode) { > // should never, ever happen > struct dentry *fucked = nd->path.dentry; > printk(KERN_ERR "%pd4 %d %x %p %d %d", fucked, d_count(fucked), > fucked->d_flags, fs, fs->users, seq); > BUG_ON(1); > return ERR_PTR(-EINVAL); > } > and see what it catches? Ah dentry count of -127 looks... odd. [ 246.102077] /newroot/foo -127 18008 ffff888012819000 6 0 [ 246.102240] ------------[ cut here ]------------ [ 246.102264] /newroot/foo -127 18008 ffff888012819000 6 6 [ 246.104163] ------------[ cut here ]------------ [ 246.104943] kernel BUG at fs/namei.c:2359! [ 246.106342] kernel BUG at fs/namei.c:2359! [ 246.106385] invalid opcode: 0000 [#1] PREEMPT SMP KASAN [ 246.110540] CPU: 0 PID: 6345 Comm: uring_viro Tainted: G W E 5.12.0-rc5-1ebc00aa82b08217d1fc4eef5435f8499783194c #53 [ 246.113725] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)/LXD, BIOS 0.0.0 02/06/2015 [ 246.116115] RIP: 0010:path_init.cold+0xbb/0xea [ 246.117711] Code: d0 7c 04 84 d2 75 4b 4c 8b 45 98 41 89 d9 44 89 f9 4c 89 e6 41 8b 94 24 d0 00 00 00 48 c7 c7 20 93 1a b1 41 55 e8 1c 70 fe ff <0f> 0b 48 8b 7d b8 e8 1f ee e6 f8 e9 55 ff ff ff 48 8b 7d 98 e8 01 [ 246.124372] RSP: 0018:ffffc900073275f0 EFLAGS: 00010282 [ 246.126466] RAX: 000000000000002c RBX: 0000000000000006 RCX: 0000000000000000 [ 246.129685] RDX: 0000000000000000 RSI: ffff88801354d700 RDI: fffff52000e64eb0 [ 246.131400] RBP: ffffc900073276a0 R08: 000000000000002c R09: ffffed1002b46045 [ 246.133241] R10: ffff888015a30227 R11: ffffed1002b46044 R12: ffff8880303eb028 [ 246.135124] R13: 0000000000000006 R14: ffffc90007327820 R15: 0000000000018008 [ 246.136931] FS: 00007f8695724800(0000) GS:ffff888015a00000(0000) knlGS:0000000000000000 [ 246.139247] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 246.141604] CR2: 000055fdf3c11008 CR3: 000000002e9dd000 CR4: 0000000000350ef0 [ 246.143437] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 246.145337] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 246.147114] Call Trace: [ 246.147910] ? write_comp_data+0x2a/0x90 [ 246.149010] path_openat+0x192/0x2790 [ 246.150062] ? path_lookupat.isra.0+0x530/0x530 [ 246.151295] ? rcu_read_lock_bh_held+0xb0/0xb0 [ 246.152561] ? lockdep_hardirqs_on_prepare+0x400/0x400 [ 246.154071] do_filp_open+0x197/0x270 [ 246.155157] ? rcu_read_lock_bh_held+0xb0/0xb0 [ 246.156392] ? may_open_dev+0xf0/0xf0 [ 246.157969] ? do_raw_spin_lock+0x125/0x2e0 [ 246.159167] ? write_comp_data+0x2a/0x90 [ 246.160319] ? __sanitizer_cov_trace_pc+0x1d/0x50 [ 246.161649] ? _raw_spin_unlock+0x29/0x40 [ 246.162823] ? alloc_fd+0x499/0x640 [ 246.164092] io_openat2+0x1d1/0x8f0 [ 246.165403] ? io_req_complete_post+0xa90/0xa90 [ 246.166974] ? __lock_acquire+0x1847/0x5850 [ 246.168455] ? write_comp_data+0x2a/0x90 [ 246.169877] io_issue_sqe+0x2a2/0x5ac0 [ 246.171226] ? lockdep_hardirqs_on_prepare+0x400/0x400 [ 246.173079] ? io_poll_complete.constprop.0+0x100/0x100 [ 246.174960] ? rcu_read_lock_sched_held+0xa1/0xd0 [ 246.176468] ? rcu_read_lock_bh_held+0xb0/0xb0 [ 246.187303] ? find_held_lock+0x2d/0x110 [ 246.197951] ? __might_fault+0xd8/0x180 [ 246.208458] __io_queue_sqe+0x19f/0xcf0 [ 246.218694] ? __check_object_size+0x1b4/0x4e0 [ 246.228802] ? __ia32_sys_io_uring_setup+0x70/0x70 [ 246.239099] ? write_comp_data+0x2a/0x90 [ 246.249152] io_queue_sqe+0x612/0xb70 [ 246.258967] io_submit_sqes+0x517d/0x6650 [ 246.268445] ? __x64_sys_io_uring_enter+0xb15/0xdd0 [ 246.282682] __x64_sys_io_uring_enter+0xb15/0xdd0 [ 246.292110] ? __ia32_sys_io_uring_enter+0xdd0/0xdd0 [ 246.301285] ? rcu_read_lock_bh_held+0xb0/0xb0 [ 246.310116] ? syscall_enter_from_user_mode+0x27/0x70 [ 246.318845] do_syscall_64+0x2d/0x70 [ 246.327328] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 246.336188] RIP: 0033:0x7f869583a67d [ 246.344145] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d bb f7 0c 00 f7 d8 64 89 01 48 [ 246.367649] RSP: 002b:000055fdf2a89e98 EFLAGS: 00000212 ORIG_RAX: 00000000000001aa [ 246.377100] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f869583a67d [ 246.386119] RDX: 0000000000000000 RSI: 00000000000045f5 RDI: 0000000000000003 [ 246.395095] RBP: 000055fdf2a89f70 R08: 0000000000000000 R09: 0000000000000000 [ 246.403925] R10: 0000000000000000 R11: 0000000000000212 R12: 000055fdf295c640 [ 246.412977] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 246.425046] Modules linked in: efi_pstore(E) efivarfs(E) [ 246.434681] invalid opcode: 0000 [#2] PREEMPT SMP KASAN [ 246.435099] ---[ end trace b331351bc5a092fa ]---