On Thu, Apr 19, 2018 at 07:44:40PM -0700, Eric Biggers wrote: > On Mon, Apr 02, 2018 at 03:34:15PM +0100, Al Viro wrote: > > On Mon, Apr 02, 2018 at 07:40:22PM +0900, Tetsuo Handa wrote: > > > > > That commit assumes that calling kill_sb() from deactivate_locked_super(s) > > > without corresponding fill_super() is safe. We have so far crashed with > > > rpc_mount() and kernfs_mount_ns(). Is that really safe? > > > > Consider the case when fill_super() returns an error immediately. > > It is exactly the same situation. And ->kill_sb() *is* called in cases > > when fill_super() has failed. Always had been - it's much less boilerplate > > that way. > > > > deactivate_locked_super() on that failure exit is the least painful > > variant, unfortunately. > > > > Filesystems with ->kill_sb() instances that rely upon something > > done between sget() and the first failure exit after it need to be fixed. > > And yes, that should've been spotted back then. Sorry. > > > > Fortunately, we don't have many of those - kill_{block,litter,anon}_super() > > are safe and those are the majority. Looking through the rest uncovers > > some bugs; so far all I've seen were already there. Note that normally > > we have something like > > static void affs_kill_sb(struct super_block *sb) > > { > > struct affs_sb_info *sbi = AFFS_SB(sb); > > kill_block_super(sb); > > if (sbi) { > > affs_free_bitmap(sb); > > affs_brelse(sbi->s_root_bh); > > kfree(sbi->s_prefix); > > mutex_destroy(&sbi->s_bmlock); > > kfree(sbi); > > } > > } > > which basically does one of the safe ones augmented with something that > > takes care *not* to assume that e.g. ->s_fs_info has been allocated. > > Not everyone does, though: > > > > jffs2_fill_super(): > > c = kzalloc(sizeof(*c), GFP_KERNEL); > > if (!c) > > return -ENOMEM; > > in the very beginning. So we can return from it with NULL ->s_fs_info. > > Now, consider > > struct jffs2_sb_info *c = JFFS2_SB_INFO(sb); > > if (!(sb->s_flags & MS_RDONLY)) > > jffs2_stop_garbage_collect_thread(c); > > in jffs2_kill_sb() and > > void jffs2_stop_garbage_collect_thread(struct jffs2_sb_info *c) > > { > > int wait = 0; > > spin_lock(&c->erase_completion_lock); > > if (c->gc_task) { > > > > IOW, fail that kzalloc() (or, indeed, an allocation in register_shrinker()) > > and eat an oops. Always had been there, always hard to hit without > > fault injectors and fortunately trivial to fix. > > > > Similar in nfs_kill_super() calling nfs_free_server(). > > Similar in v9fs_kill_super() with v9fs_session_cancel()/v9fs_session_close() calls. > > Similar in hypfs_kill_super(), afs_kill_super(), btrfs_kill_super(), cifs_kill_sb() > > (all trivial to fix) > > > > Aha... nfsd_umount() is a new regression. > > > > orangefs: old, trivial to fix. > > > > cgroup_kill_sb(): old, hopefully easy to fix. Note that kernfs_root_from_sb() > > can bloody well return NULL, making cgroup_root_from_kf() oops. Always had been > > there. > > > > AFAICS, after discarding the instances that do the right thing we are left with: > > hypfs_kill_super, rdt_kill_sb, v9fs_kill_super, afs_kill_super, btrfs_kill_super, > > cifs_kill_sb, jffs2_kill_sb, nfs_kill_super, nfsd_umount, orangefs_kill_sb, > > proc_kill_sb, sysfs_kill_sb, cgroup_kill_sb, rpc_kill_sb. > > > > Out of those, nfsd_umount(), proc_kill_sb() and rpc_kill_sb() are regressions. > > So are rdt_kill_sb() and sysfs_kill_sb() (victims of the issue you've spotted > > in kernfs_kill_sb()). The rest are old (and I wonder if syzbot had been > > catching those - they are also dependent upon a specific allocation failing > > at the right time). > > > > Fix for the kernfs bug is now queued in vfs/for-linus: > > #syz fix: kernfs: deal with early sget() failures > But, there is still a related bug: when mounting sysfs, if register_shrinker() fails in sget_userns(), then kernfs_kill_sb() gets called, which frees the 'struct kernfs_super_info'. But, the 'struct kernfs_super_info' is also freed in kernfs_mount_ns() by: sb = sget_userns(fs_type, kernfs_test_super, kernfs_set_super, flags, &init_user_ns, info); if (IS_ERR(sb) || sb->s_fs_info != info) kfree(info); if (IS_ERR(sb)) return ERR_CAST(sb); I guess the problem is that sget_userns() shouldn't take ownership of the 'info' if it returns an error -- but, it actually does if register_shrinker() fails, resulting in a double free. Here is a reproducer and the KASAN splat. This is on Linus' tree (87ef12027b9b) with vfs/for-linus merged in. #define _GNU_SOURCE #include <fcntl.h> #include <sched.h> #include <stdio.h> #include <stdlib.h> #include <sys/mount.h> #include <sys/stat.h> #include <unistd.h> int main() { int fd, i; char buf[16]; unshare(CLONE_NEWNET); system("echo N > /sys/kernel/debug/failslab/ignore-gfp-wait"); system("echo 0 | tee /sys/kernel/debug/fail*/verbose"); fd = open("/proc/thread-self/fail-nth", O_WRONLY); for (i = 0; ; i++) { write(fd, buf, sprintf(buf, "%d", i)); mount("sysfs", "mnt", "sysfs", 0, NULL); umount("mnt"); } } ================================================================== BUG: KASAN: double-free or invalid-free in kernfs_mount_ns+0x5eb/0x7d0 fs/kernfs/mount.c:324 CPU: 12 PID: 268 Comm: syz_kernfs Not tainted 4.17.0-rc1-00038-gb25f027a33d42 #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x7c/0xbb lib/dump_stack.c:113 print_address_description+0x73/0x290 mm/kasan/report.c:256 kasan_report_invalid_free+0x60/0xa0 mm/kasan/report.c:336 __kasan_slab_free+0x1d5/0x220 mm/kasan/kasan.c:501 __cache_free mm/slab.c:3498 [inline] kfree+0x8a/0xd0 mm/slab.c:3813 kernfs_mount_ns+0x5eb/0x7d0 fs/kernfs/mount.c:324 sysfs_mount+0xa2/0x170 fs/sysfs/mount.c:36 mount_fs+0x38/0x1b0 fs/super.c:1268 vfs_kern_mount.part.10+0x53/0x3a0 fs/namespace.c:1037 vfs_kern_mount fs/namespace.c:2514 [inline] do_new_mount fs/namespace.c:2517 [inline] do_mount+0x35c/0x2560 fs/namespace.c:2847 ksys_mount+0x7b/0xd0 fs/namespace.c:3063 __do_sys_mount fs/namespace.c:3077 [inline] __se_sys_mount fs/namespace.c:3074 [inline] __x64_sys_mount+0xb5/0x150 fs/namespace.c:3074 do_syscall_64+0x90/0x3f0 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7febbf5877ea RSP: 002b:00007ffee25976a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000560d99d26af4 RCX: 00007febbf5877ea RDX: 0000560d99d26aee RSI: 0000560d99d26af4 RDI: 0000560d99d26aee RBP: 0000560d99d26aee R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffee25976b0 R13: 0000560d99d26aeb R14: 0000000000000003 R15: 000000000000000c Allocated by task 268: kmem_cache_alloc_trace include/linux/slab.h:415 [inline] kmalloc include/linux/slab.h:512 [inline] kzalloc include/linux/slab.h:701 [inline] kernfs_mount_ns+0x71/0x7d0 fs/kernfs/mount.c:313 sysfs_mount+0xa2/0x170 fs/sysfs/mount.c:36 mount_fs+0x38/0x1b0 fs/super.c:1268 vfs_kern_mount.part.10+0x53/0x3a0 fs/namespace.c:1037 vfs_kern_mount fs/namespace.c:2514 [inline] do_new_mount fs/namespace.c:2517 [inline] do_mount+0x35c/0x2560 fs/namespace.c:2847 ksys_mount+0x7b/0xd0 fs/namespace.c:3063 __do_sys_mount fs/namespace.c:3077 [inline] __se_sys_mount fs/namespace.c:3074 [inline] __x64_sys_mount+0xb5/0x150 fs/namespace.c:3074 do_syscall_64+0x90/0x3f0 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 268: __cache_free mm/slab.c:3498 [inline] kfree+0x8a/0xd0 mm/slab.c:3813 sysfs_kill_sb+0x15/0x30 fs/sysfs/mount.c:50 deactivate_locked_super+0x80/0xc0 fs/super.c:313 sget_userns+0x90c/0xb50 fs/super.c:523 kernfs_mount_ns+0x134/0x7d0 fs/kernfs/mount.c:321 sysfs_mount+0xa2/0x170 fs/sysfs/mount.c:36 mount_fs+0x38/0x1b0 fs/super.c:1268 vfs_kern_mount.part.10+0x53/0x3a0 fs/namespace.c:1037 vfs_kern_mount fs/namespace.c:2514 [inline] do_new_mount fs/namespace.c:2517 [inline] do_mount+0x35c/0x2560 fs/namespace.c:2847 ksys_mount+0x7b/0xd0 fs/namespace.c:3063 __do_sys_mount fs/namespace.c:3077 [inline] __se_sys_mount fs/namespace.c:3074 [inline] __x64_sys_mount+0xb5/0x150 fs/namespace.c:3074 do_syscall_64+0x90/0x3f0 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff88006a6e6480 which belongs to the cache kmalloc-64 of size 64 The buggy address is located 0 bytes inside of 64-byte region [ffff88006a6e6480, ffff88006a6e64c0) The buggy address belongs to the page: page:ffffea0001748250 count:1 mapcount:0 mapping:ffff88006a6e6000 index:0xffff88006a6e6f80 flags: 0x100000000000100(slab) raw: 0100000000000100 ffff88006a6e6000 ffff88006a6e6f80 0000000100000010 raw: ffff88006d401338 ffff88006d401338 ffff88006d400200 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88006a6e6380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88006a6e6400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88006a6e6480: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ^ ffff88006a6e6500: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff88006a6e6580: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ==================================================================