On Wed, Feb 24, 2016 at 11:03 AM, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote: > On Wed, Feb 24, 2016 at 5:46 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: >> On Wed, Feb 24, 2016 at 11:12:13AM +0800, Ian Kent wrote: >>> On Wed, 2016-02-17 at 00:40 +0100, Mickaël Salaün wrote: >>> > Hi, >>> > >>> > Actually I found the same bug (without fuzzing) and I can reproduce it >>> > in a deterministic way (e.g. by creating a LSM that return 1 for the >>> > security_file_open hook). At least, from v4.2.8 I can easily trigger >>> > traces like this : >>> >>> Reading through this thread I wonder if this is a new problem. >>> >>> Is there a previous kernel it can't be reproduced on? >>> Perhaps a bisect will shed some light on what's happening. >> >> There are several things in the mix. What Mickaël has found is that a bunch >> of places where _positive_ number returned instead of expected 0 or -E... can >> get propagated all way back to caller of do_last(), confusing the hell out >> of it. That's not what Dmitry has triggered, though. WARN_ON() in the >> end of do_last() would've triggered, and IMO this one, along with mitigation >> (map that "error value" to -EINVAL) should go into mainline and all -stable. >> Bogus ->open() returning a positive number had always been bad news; in the >> best case it would be returned to userland, leading to "open(2) has failed >> and returned a positive number". Hell knows if we ever had such instances >> (or have them right now), but I wouldn't bet on their absense. Rare >> failure exits returning bogus values in an ->open() instance in some driver >> can easily stay unnoticed for a long time. Starting from at least 3.6 >> (circa the atomic_open support) it got more unpleasant than simple "confuse >> the hell out of userland". ->open() isn't the only vector for injection of >> such crap - ->permission() would also serve, same for several LSM turds, etc. >> >> Again, that's a separate problem. What Dmitry seems to be catching is getting >> crap values fed to should_follow_link() as inode. I see one bug in that >> area that does need fixing (fix present in the last patch I've posted, with >> WARN_ON() to indicate that this thing has triggered; _that_ WARN_ON() should >> be gone from the final variant, since this can trigger without driver bugs, >> etc.) In RCU mode after we'd checked that dentry is a symlink one, we need >> to verify that it hadn't been changed since before we'd fetched the inode. >> It might have been e.g. a regular file, which got unlinked with symlink >> created in its place. Then we'd go into get_link() with non-symlink inode >> and oops on attempt to call its ->i_op->get_link(). That race is real, very >> hard to hit (you need both the unlink(2) and symlink(2) to happen between >> lookup_fast() and should_follow_link() and unless you have PREEMPT_RCU you >> can't lose the timeslice there) and would've manifested differently. >> >> But that leaves other two kinds of bugs getting triggered: inode of some >> non-symlink is possible, but what we saw included NULL inode when we'd >> reached finish_open: in do_last(). Should be flat-out impossible - we either >> get lookup_fast(..., &inode, ...) return 0 and store NULL in inode, or >> get NULL inode from pinned d_is_symlink() dentry after having grabbed >> a reference and left RCU mode. Neither should be possible without either >> something weird happening to lookup_fast() (and we would've seen oopsen in >> link_path_walk() if that could happen, BTW) or screwed dentry refcounting >> somewhere, combined with a race that managed to turn... >> >> Oh, shit. No screwed refcounting is needed. Look: >> BUG_ON(nd->flags & LOOKUP_RCU); >> inode = d_backing_inode(path.dentry); >> seq = 0; /* out of RCU mode, so the value doesn't matter */ >> if (unlikely(d_is_negative(path.dentry))) { >> path_to_nameidata(&path, nd); >> return -ENOENT; >> } >> Suppose we come here with negative path.dentry. We are holding a reference, >> all right, and for a _postive_ dentry that would've been enough to keep >> it positive. Not so for a negative one, though - symlink(2) on another >> CPU doint d_instantiate() just before the d_is_negative() check and we >> are fucked - inode is stale and we sail through all the checks, all the >> way into should_follow_link(). >> >> We also have the same kind of crap in walk_component() - >> err = lookup_slow(nd, &path); >> if (err < 0) >> return err; >> inode = d_backing_inode(path.dentry); >> seq = 0; /* we are already out of RCU mode */ >> err = -ENOENT; >> if (d_is_negative(path.dentry)) >> goto out_path_put; >> There it's much harder to hit, though - we need it not just d_instantiate() >> overlapping those lines; we need the racing syscall to get from locking >> the parent to d_instantiate() between the point where lookup_slow() has >> unlocked the parent and d_is_negative(). And lookup_slow() couldn't have >> gone into mountpoint crossing, so it's really hard to hit - you pretty >> much have to get preempted just after fetching inode. >> >> OK, the next delta to try, and there definitely are several commits in >> that pile. It still does not explain the traces with GPF at 0x50, though - >> for that we need not just a NULL getting to should_follow_link() but >> something non-NULL with NULL at offset 40 from it (offset of ->i_sb in >> struct inode). That something *can't* be a valid struct inode or had been >> one in recent past - ->i_sb is assigned in new_inode(), value is always >> non-NULL and never modified all the way until RCU-delayed freeing of struct >> inode. So that has to be something entirely different... Anyway, the >> patch so far follows: > > > Restarted testing with this patch (dropped previous patches because > they conflict). > > These "unlikely" scenarios can be more likely inside of VMs where > effective preemption can happen at random points. Also NMIs probably > can increase probability of such races. For now I can only say that I am hitting this one (3 times in 20 mins): if (read_seqcount_retry(&link->dentry->d_seq, seq)) { WARN_ON(1); // just as way to report hitting // that path; it's OK to get // here, in the final variant // WARN_ON will disappear. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html