On Wed, Feb 24, 2016 at 11:15 AM, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote: > On Wed, Feb 24, 2016 at 11:03 AM, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote: >> On Wed, Feb 24, 2016 at 5:46 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: >>> On Wed, Feb 24, 2016 at 11:12:13AM +0800, Ian Kent wrote: >>>> On Wed, 2016-02-17 at 00:40 +0100, Mickaël Salaün wrote: >>>> > Hi, >>>> > >>>> > Actually I found the same bug (without fuzzing) and I can reproduce it >>>> > in a deterministic way (e.g. by creating a LSM that return 1 for the >>>> > security_file_open hook). At least, from v4.2.8 I can easily trigger >>>> > traces like this : >>>> >>>> Reading through this thread I wonder if this is a new problem. >>>> >>>> Is there a previous kernel it can't be reproduced on? >>>> Perhaps a bisect will shed some light on what's happening. >>> >>> There are several things in the mix. What Mickaël has found is that a bunch >>> of places where _positive_ number returned instead of expected 0 or -E... can >>> get propagated all way back to caller of do_last(), confusing the hell out >>> of it. That's not what Dmitry has triggered, though. WARN_ON() in the >>> end of do_last() would've triggered, and IMO this one, along with mitigation >>> (map that "error value" to -EINVAL) should go into mainline and all -stable. >>> Bogus ->open() returning a positive number had always been bad news; in the >>> best case it would be returned to userland, leading to "open(2) has failed >>> and returned a positive number". Hell knows if we ever had such instances >>> (or have them right now), but I wouldn't bet on their absense. Rare >>> failure exits returning bogus values in an ->open() instance in some driver >>> can easily stay unnoticed for a long time. Starting from at least 3.6 >>> (circa the atomic_open support) it got more unpleasant than simple "confuse >>> the hell out of userland". ->open() isn't the only vector for injection of >>> such crap - ->permission() would also serve, same for several LSM turds, etc. >>> >>> Again, that's a separate problem. What Dmitry seems to be catching is getting >>> crap values fed to should_follow_link() as inode. I see one bug in that >>> area that does need fixing (fix present in the last patch I've posted, with >>> WARN_ON() to indicate that this thing has triggered; _that_ WARN_ON() should >>> be gone from the final variant, since this can trigger without driver bugs, >>> etc.) In RCU mode after we'd checked that dentry is a symlink one, we need >>> to verify that it hadn't been changed since before we'd fetched the inode. >>> It might have been e.g. a regular file, which got unlinked with symlink >>> created in its place. Then we'd go into get_link() with non-symlink inode >>> and oops on attempt to call its ->i_op->get_link(). That race is real, very >>> hard to hit (you need both the unlink(2) and symlink(2) to happen between >>> lookup_fast() and should_follow_link() and unless you have PREEMPT_RCU you >>> can't lose the timeslice there) and would've manifested differently. >>> >>> But that leaves other two kinds of bugs getting triggered: inode of some >>> non-symlink is possible, but what we saw included NULL inode when we'd >>> reached finish_open: in do_last(). Should be flat-out impossible - we either >>> get lookup_fast(..., &inode, ...) return 0 and store NULL in inode, or >>> get NULL inode from pinned d_is_symlink() dentry after having grabbed >>> a reference and left RCU mode. Neither should be possible without either >>> something weird happening to lookup_fast() (and we would've seen oopsen in >>> link_path_walk() if that could happen, BTW) or screwed dentry refcounting >>> somewhere, combined with a race that managed to turn... >>> >>> Oh, shit. No screwed refcounting is needed. Look: >>> BUG_ON(nd->flags & LOOKUP_RCU); >>> inode = d_backing_inode(path.dentry); >>> seq = 0; /* out of RCU mode, so the value doesn't matter */ >>> if (unlikely(d_is_negative(path.dentry))) { >>> path_to_nameidata(&path, nd); >>> return -ENOENT; >>> } >>> Suppose we come here with negative path.dentry. We are holding a reference, >>> all right, and for a _postive_ dentry that would've been enough to keep >>> it positive. Not so for a negative one, though - symlink(2) on another >>> CPU doint d_instantiate() just before the d_is_negative() check and we >>> are fucked - inode is stale and we sail through all the checks, all the >>> way into should_follow_link(). >>> >>> We also have the same kind of crap in walk_component() - >>> err = lookup_slow(nd, &path); >>> if (err < 0) >>> return err; >>> inode = d_backing_inode(path.dentry); >>> seq = 0; /* we are already out of RCU mode */ >>> err = -ENOENT; >>> if (d_is_negative(path.dentry)) >>> goto out_path_put; >>> There it's much harder to hit, though - we need it not just d_instantiate() >>> overlapping those lines; we need the racing syscall to get from locking >>> the parent to d_instantiate() between the point where lookup_slow() has >>> unlocked the parent and d_is_negative(). And lookup_slow() couldn't have >>> gone into mountpoint crossing, so it's really hard to hit - you pretty >>> much have to get preempted just after fetching inode. >>> >>> OK, the next delta to try, and there definitely are several commits in >>> that pile. It still does not explain the traces with GPF at 0x50, though - >>> for that we need not just a NULL getting to should_follow_link() but >>> something non-NULL with NULL at offset 40 from it (offset of ->i_sb in >>> struct inode). That something *can't* be a valid struct inode or had been >>> one in recent past - ->i_sb is assigned in new_inode(), value is always >>> non-NULL and never modified all the way until RCU-delayed freeing of struct >>> inode. So that has to be something entirely different... Anyway, the >>> patch so far follows: Fired after some time: [ 3491.147607] ------------[ cut here ]------------ [ 3491.147986] WARNING: CPU: 1 PID: 17000 at fs/namei.c:1728 path_openat+0x14e5/0x1560() [ 3491.148576] Modules linked in: [ 3491.148818] CPU: 1 PID: 17000 Comm: syz-executor Tainted: G W 4.5.0-rc5+ #72 [ 3491.149530] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 3491.149530] 0000000000000000 ffff88002d6ebcb8 ffffffff8194ad79 0000000000000000 [ 3491.152103] ffffffff8334501c ffff88002d6ebcf0 ffffffff81172291 0000000000000000 [ 3491.152103] ffff88002d6ebd98 ffff88002d6ebde0 0000000000048000 ffff88002d6ebefc [ 3491.155310] Call Trace: [ 3491.155310] [<ffffffff8194ad79>] dump_stack+0x99/0xd0 [ 3491.155310] [<ffffffff81172291>] warn_slowpath_common+0x81/0xc0 [ 3491.157483] [<ffffffff81172385>] warn_slowpath_null+0x15/0x20 [ 3491.157483] [<ffffffff81318495>] path_openat+0x14e5/0x1560 [ 3491.157483] [<ffffffff81319549>] do_filp_open+0x79/0xd0 [ 3491.157483] [<ffffffff82b11d92>] ? _raw_spin_unlock+0x22/0x30 [ 3491.157483] [<ffffffff81328f08>] ? __alloc_fd+0xf8/0x200 [ 3491.157483] [<ffffffff81306eb0>] do_sys_open+0x110/0x1f0 [ 3491.157483] [<ffffffff81306fbf>] SyS_openat+0xf/0x20 [ 3491.157483] [<ffffffff82b12836>] entry_SYSCALL_64_fastpath+0x16/0x7a [ 3491.161521] ---[ end trace c65af9be7536efae ]--- [ 3491.161838] BUG: unable to handle kernel NULL pointer dereference at 000000000000000c [ 3491.162339] IP: [<ffffffff81327429>] atime_needs_update+0x9/0xc0 [ 3491.162577] PGD 2d626067 PUD 2c4cc067 PMD 0 [ 3491.163036] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 3491.163039] Modules linked in: [ 3491.163039] CPU: 1 PID: 17000 Comm: syz-executor Tainted: G W 4.5.0-rc5+ #72 [ 3491.163039] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 3491.164386] task: ffff88002c9196c0 ti: ffff88002d6e8000 task.ti: ffff88002d6e8000 [ 3491.164386] RIP: 0010:[<ffffffff81327429>] [<ffffffff81327429>] atime_needs_update+0x9/0xc0 [ 3491.164386] RSP: 0018:ffff88002d6ebcb0 EFLAGS: 00010282 [ 3491.165591] RAX: 0000000000000030 RBX: ffff88002d6ebde0 RCX: 0000000000000002 [ 3491.165591] RDX: ffff88002d6ebe38 RSI: 0000000000000000 RDI: ffff88002d6ebe38 [ 3491.165591] RBP: ffff88002d6ebcc0 R08: ffff88002d6ebe38 R09: 0000000000000001 [ 3491.167202] R10: 0000000000000001 R11: 0000000000001816 R12: 0000000000000000 [ 3491.167368] R13: ffff88002d6ebe38 R14: ffff880072508980 R15: ffff88002d6ebefc [ 3491.167368] FS: 00007f02454e5700(0000) GS:ffff88003ed00000(0000) knlGS:0000000000000000 [ 3491.167368] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 3491.167368] CR2: 000000000000000c CR3: 000000002cb5f000 CR4: 00000000000006e0 [ 3491.167368] Stack: [ 3491.169740] 0000000000000000 ffff88002d6ebde0 ffff88002d6ebd00 ffffffff81314922 [ 3491.169740] 0000000000000001 0000000000000000 ffff88002d6ebd98 ffff88002d6ebde0 [ 3491.169740] 0000000000048000 ffff88002d6ebefc ffff88002d6ebdd0 ffffffff8131729d [ 3491.169740] Call Trace: [ 3491.169740] [<ffffffff81314922>] trailing_symlink+0x62/0x260 [ 3491.169740] [<ffffffff8131729d>] path_openat+0x2ed/0x1560 [ 3491.169740] [<ffffffff81319549>] do_filp_open+0x79/0xd0 [ 3491.169740] [<ffffffff82b11d92>] ? _raw_spin_unlock+0x22/0x30 [ 3491.169740] [<ffffffff81328f08>] ? __alloc_fd+0xf8/0x200 [ 3491.169740] [<ffffffff81306eb0>] do_sys_open+0x110/0x1f0 [ 3491.169740] [<ffffffff81306fbf>] SyS_openat+0xf/0x20 [ 3491.169740] [<ffffffff82b12836>] entry_SYSCALL_64_fastpath+0x16/0x7a [ 3491.169740] Code: ff ff ff 48 85 c0 48 89 c3 74 08 48 89 c7 e8 ef dc ff ff 48 89 d8 5b 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 48 83 ec 08 <f6> 46 0c 02 48 8b 1f 75 6b 48 8b 7e 28 48 8b 47 50 a9 01 04 00 [ 3491.169740] RIP [<ffffffff81327429>] atime_needs_update+0x9/0xc0 [ 3491.169740] RSP <ffff88002d6ebcb0> [ 3491.169740] CR2: 000000000000000c [ 3491.195051] ---[ end trace c65af9be7536efaf ]--- [ 3491.195051] BUG: sleeping function called from invalid context at include/linux/sched.h:2795 [ 3491.195051] in_atomic(): 1, irqs_disabled(): 1, pid: 17000, name: syz-executor [ 3491.195051] INFO: lockdep is turned off. [ 3491.195051] irq event stamp: 622 [ 3491.195051] hardirqs last enabled at (621): [<ffffffff811d9926>] vprintk_emit+0x2d6/0x5f0 [ 3491.195051] hardirqs last disabled at (622): [<ffffffff82b14d49>] error_entry+0x69/0xc0 [ 3491.195051] softirqs last enabled at (618): [<ffffffff81178172>] __do_softirq+0x222/0x4a0 [ 3491.195051] softirqs last disabled at (595): [<ffffffff81178767>] irq_exit+0xa7/0xc0 [ 3491.195051] CPU: 1 PID: 17000 Comm: syz-executor Tainted: G D W 4.5.0-rc5+ #72 [ 3491.195051] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 3491.195051] 0000000000000000 ffff88002d6eb9a8 ffffffff8194ad79 ffff88002c9196c0 [ 3491.195051] 0000000000004268 ffff88002d6eb9d0 ffffffff811a0659 ffffffff8329b7fb [ 3491.195051] 0000000000000aeb 0000000000000000 ffff88002d6eb9f8 ffffffff811a0764 [ 3491.195051] Call Trace: [ 3491.195051] [<ffffffff8194ad79>] dump_stack+0x99/0xd0 [ 3491.195051] [<ffffffff811a0659>] ___might_sleep+0x179/0x240 [ 3491.195051] [<ffffffff811a0764>] __might_sleep+0x44/0x80 [ 3491.195051] [<ffffffff811842cf>] exit_signals+0x1f/0x130 [ 3491.195051] [<ffffffff811756ef>] do_exit+0xbf/0xd10 [ 3491.195051] [<ffffffff811da524>] ? kmsg_dump+0x104/0x180 [ 3491.195051] [<ffffffff8108549f>] oops_end+0x9f/0xe0 [ 3491.195051] [<ffffffff810ce6e8>] no_context+0x108/0x390 [ 3491.195051] [<ffffffff810cea8d>] __bad_area_nosemaphore+0x11d/0x220 [ 3491.195051] [<ffffffff810ceb9e>] bad_area_nosemaphore+0xe/0x10 [ 3491.195051] [<ffffffff810cf2c4>] __do_page_fault+0x84/0x470 [ 3491.195051] [<ffffffff810cf764>] trace_do_page_fault+0x74/0x2c0 [ 3491.195051] [<ffffffff810c9ba4>] do_async_page_fault+0x14/0x90 [ 3491.195051] [<ffffffff82b14b78>] async_page_fault+0x28/0x30 [ 3491.195051] [<ffffffff81327429>] ? atime_needs_update+0x9/0xc0 [ 3491.195051] [<ffffffff81314922>] trailing_symlink+0x62/0x260 [ 3491.195051] [<ffffffff8131729d>] path_openat+0x2ed/0x1560 [ 3491.195051] [<ffffffff81319549>] do_filp_open+0x79/0xd0 [ 3491.195051] [<ffffffff82b11d92>] ? _raw_spin_unlock+0x22/0x30 [ 3491.195051] [<ffffffff81328f08>] ? __alloc_fd+0xf8/0x200 [ 3491.195051] [<ffffffff81306eb0>] do_sys_open+0x110/0x1f0 [ 3491.195051] [<ffffffff81306fbf>] SyS_openat+0xf/0x20 [ 3491.195051] [<ffffffff82b12836>] entry_SYSCALL_64_fastpath+0x16/0x7a [ 3491.284831] note: syz-executor[17000] exited with preempt_count 1 The warning is this one: static inline int should_follow_link(struct nameidata *nd, struct path *link, int follow, struct inode *inode, unsigned seq) { .... WARN_ON(!inode); // now, _that_ should not happen. return pick_link(nd, link, inode, seq); } -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html