Re: fs: NULL deref in atime_needs_update

Dmitry Vyukov <dvyukov@xxxxxxxxxx> · Wed, 24 Feb 2016 11:15:20 +0100

On Wed, Feb 24, 2016 at 11:03 AM, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
> On Wed, Feb 24, 2016 at 5:46 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>> On Wed, Feb 24, 2016 at 11:12:13AM +0800, Ian Kent wrote:
>>> On Wed, 2016-02-17 at 00:40 +0100, Mickaël Salaün wrote:
>>> > Hi,
>>> >
>>> > Actually I found the same bug (without fuzzing) and I can reproduce it
>>> > in a deterministic way (e.g. by creating a LSM that return 1 for the
>>> > security_file_open hook). At least, from v4.2.8 I can easily trigger
>>> > traces like this :
>>>
>>> Reading through this thread I wonder if this is a new problem.
>>>
>>> Is there a previous kernel it can't be reproduced on?
>>> Perhaps a bisect will shed some light on what's happening.
>>
>> There are several things in the mix.  What Mickaël has found is that a bunch
>> of places where _positive_ number returned instead of expected 0 or -E... can
>> get propagated all way back to caller of do_last(), confusing the hell out
>> of it.  That's not what Dmitry has triggered, though.  WARN_ON() in the
>> end of do_last() would've triggered, and IMO this one, along with mitigation
>> (map that "error value" to -EINVAL) should go into mainline and all -stable.
>> Bogus ->open() returning a positive number had always been bad news; in the
>> best case it would be returned to userland, leading to "open(2) has failed
>> and returned a positive number".  Hell knows if we ever had such instances
>> (or have them right now), but I wouldn't bet on their absense.  Rare
>> failure exits returning bogus values in an ->open() instance in some driver
>> can easily stay unnoticed for a long time.  Starting from at least 3.6
>> (circa the atomic_open support) it got more unpleasant than simple "confuse
>> the hell out of userland".  ->open() isn't the only vector for injection of
>> such crap - ->permission() would also serve, same for several LSM turds, etc.
>>
>> Again, that's a separate problem.  What Dmitry seems to be catching is getting
>> crap values fed to should_follow_link() as inode.  I see one bug in that
>> area that does need fixing (fix present in the last patch I've posted, with
>> WARN_ON() to indicate that this thing has triggered; _that_ WARN_ON() should
>> be gone from the final variant, since this can trigger without driver bugs,
>> etc.)  In RCU mode after we'd checked that dentry is a symlink one, we need
>> to verify that it hadn't been changed since before we'd fetched the inode.
>> It might have been e.g. a regular file, which got unlinked with symlink
>> created in its place.  Then we'd go into get_link() with non-symlink inode
>> and oops on attempt to call its ->i_op->get_link().  That race is real, very
>> hard to hit (you need both the unlink(2) and symlink(2) to happen between
>> lookup_fast() and should_follow_link() and unless you have PREEMPT_RCU you
>> can't lose the timeslice there) and would've manifested differently.
>>
>> But that leaves other two kinds of bugs getting triggered: inode of some
>> non-symlink is possible, but what we saw included NULL inode when we'd
>> reached finish_open: in do_last().  Should be flat-out impossible - we either
>> get lookup_fast(..., &inode, ...) return 0 and store NULL in inode, or
>> get NULL inode from pinned d_is_symlink() dentry after having grabbed
>> a reference and left RCU mode.  Neither should be possible without either
>> something weird happening to lookup_fast() (and we would've seen oopsen in
>> link_path_walk() if that could happen, BTW) or screwed dentry refcounting
>> somewhere, combined with a race that managed to turn...
>>
>> Oh, shit.  No screwed refcounting is needed.  Look:
>>         BUG_ON(nd->flags & LOOKUP_RCU);
>>         inode = d_backing_inode(path.dentry);
>>         seq = 0;        /* out of RCU mode, so the value doesn't matter */
>>         if (unlikely(d_is_negative(path.dentry))) {
>>                 path_to_nameidata(&path, nd);
>>                 return -ENOENT;
>>         }
>> Suppose we come here with negative path.dentry.  We are holding a reference,
>> all right, and for a _postive_ dentry that would've been enough to keep
>> it positive.  Not so for a negative one, though - symlink(2) on another
>> CPU doint d_instantiate() just before the d_is_negative() check and we
>> are fucked - inode is stale and we sail through all the checks, all the
>> way into should_follow_link().
>>
>> We also have the same kind of crap in walk_component() -
>>                 err = lookup_slow(nd, &path);
>>                 if (err < 0)
>>                         return err;
>>                 inode = d_backing_inode(path.dentry);
>>                 seq = 0;        /* we are already out of RCU mode */
>>                 err = -ENOENT;
>>                 if (d_is_negative(path.dentry))
>>                         goto out_path_put;
>> There it's much harder to hit, though - we need it not just d_instantiate()
>> overlapping those lines; we need the racing syscall to get from locking
>> the parent to d_instantiate() between the point where lookup_slow() has
>> unlocked the parent and d_is_negative().  And lookup_slow() couldn't have
>> gone into mountpoint crossing, so it's really hard to hit - you pretty
>> much have to get preempted just after fetching inode.
>>
>> OK, the next delta to try, and there definitely are several commits in
>> that pile.  It still does not explain the traces with GPF at 0x50, though -
>> for that we need not just a NULL getting to should_follow_link() but
>> something non-NULL with NULL at offset 40 from it (offset of ->i_sb in
>> struct inode).  That something *can't* be a valid struct inode or had been
>> one in recent past - ->i_sb is assigned in new_inode(), value is always
>> non-NULL and never modified all the way until RCU-delayed freeing of struct
>> inode.  So that has to be something entirely different...  Anyway, the
>> patch so far follows:
>
>
> Restarted testing with this patch (dropped previous patches because
> they conflict).
>
> These "unlikely" scenarios can be more likely inside of VMs where
> effective preemption can happen at random points. Also NMIs probably
> can increase probability of such races.

For now I can only say that I am hitting this one (3 times in 20 mins):

                if (read_seqcount_retry(&link->dentry->d_seq, seq)) {
                        WARN_ON(1);     // just as way to report hitting
                                        // that path; it's OK to get
                                        // here, in the final variant
                                        // WARN_ON will disappear.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html