On Wed, Feb 24, 2016 at 11:12:13AM +0800, Ian Kent wrote: > On Wed, 2016-02-17 at 00:40 +0100, Mickaël Salaün wrote: > > Hi, > > > > Actually I found the same bug (without fuzzing) and I can reproduce it > > in a deterministic way (e.g. by creating a LSM that return 1 for the > > security_file_open hook). At least, from v4.2.8 I can easily trigger > > traces like this : > > Reading through this thread I wonder if this is a new problem. > > Is there a previous kernel it can't be reproduced on? > Perhaps a bisect will shed some light on what's happening. There are several things in the mix. What Mickaël has found is that a bunch of places where _positive_ number returned instead of expected 0 or -E... can get propagated all way back to caller of do_last(), confusing the hell out of it. That's not what Dmitry has triggered, though. WARN_ON() in the end of do_last() would've triggered, and IMO this one, along with mitigation (map that "error value" to -EINVAL) should go into mainline and all -stable. Bogus ->open() returning a positive number had always been bad news; in the best case it would be returned to userland, leading to "open(2) has failed and returned a positive number". Hell knows if we ever had such instances (or have them right now), but I wouldn't bet on their absense. Rare failure exits returning bogus values in an ->open() instance in some driver can easily stay unnoticed for a long time. Starting from at least 3.6 (circa the atomic_open support) it got more unpleasant than simple "confuse the hell out of userland". ->open() isn't the only vector for injection of such crap - ->permission() would also serve, same for several LSM turds, etc. Again, that's a separate problem. What Dmitry seems to be catching is getting crap values fed to should_follow_link() as inode. I see one bug in that area that does need fixing (fix present in the last patch I've posted, with WARN_ON() to indicate that this thing has triggered; _that_ WARN_ON() should be gone from the final variant, since this can trigger without driver bugs, etc.) In RCU mode after we'd checked that dentry is a symlink one, we need to verify that it hadn't been changed since before we'd fetched the inode. It might have been e.g. a regular file, which got unlinked with symlink created in its place. Then we'd go into get_link() with non-symlink inode and oops on attempt to call its ->i_op->get_link(). That race is real, very hard to hit (you need both the unlink(2) and symlink(2) to happen between lookup_fast() and should_follow_link() and unless you have PREEMPT_RCU you can't lose the timeslice there) and would've manifested differently. But that leaves other two kinds of bugs getting triggered: inode of some non-symlink is possible, but what we saw included NULL inode when we'd reached finish_open: in do_last(). Should be flat-out impossible - we either get lookup_fast(..., &inode, ...) return 0 and store NULL in inode, or get NULL inode from pinned d_is_symlink() dentry after having grabbed a reference and left RCU mode. Neither should be possible without either something weird happening to lookup_fast() (and we would've seen oopsen in link_path_walk() if that could happen, BTW) or screwed dentry refcounting somewhere, combined with a race that managed to turn... Oh, shit. No screwed refcounting is needed. Look: BUG_ON(nd->flags & LOOKUP_RCU); inode = d_backing_inode(path.dentry); seq = 0; /* out of RCU mode, so the value doesn't matter */ if (unlikely(d_is_negative(path.dentry))) { path_to_nameidata(&path, nd); return -ENOENT; } Suppose we come here with negative path.dentry. We are holding a reference, all right, and for a _postive_ dentry that would've been enough to keep it positive. Not so for a negative one, though - symlink(2) on another CPU doint d_instantiate() just before the d_is_negative() check and we are fucked - inode is stale and we sail through all the checks, all the way into should_follow_link(). We also have the same kind of crap in walk_component() - err = lookup_slow(nd, &path); if (err < 0) return err; inode = d_backing_inode(path.dentry); seq = 0; /* we are already out of RCU mode */ err = -ENOENT; if (d_is_negative(path.dentry)) goto out_path_put; There it's much harder to hit, though - we need it not just d_instantiate() overlapping those lines; we need the racing syscall to get from locking the parent to d_instantiate() between the point where lookup_slow() has unlocked the parent and d_is_negative(). And lookup_slow() couldn't have gone into mountpoint crossing, so it's really hard to hit - you pretty much have to get preempted just after fetching inode. OK, the next delta to try, and there definitely are several commits in that pile. It still does not explain the traces with GPF at 0x50, though - for that we need not just a NULL getting to should_follow_link() but something non-NULL with NULL at offset 40 from it (offset of ->i_sb in struct inode). That something *can't* be a valid struct inode or had been one in recent past - ->i_sb is assigned in new_inode(), value is always non-NULL and never modified all the way until RCU-delayed freeing of struct inode. So that has to be something entirely different... Anyway, the patch so far follows: diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c index c6d7d3d..86f81e3 100644 --- a/fs/autofs4/root.c +++ b/fs/autofs4/root.c @@ -323,6 +323,7 @@ static struct dentry *autofs4_mountpoint_changed(struct path *path) struct dentry *new = d_lookup(parent, &dentry->d_name); if (!new) return NULL; + WARN_ON(d_is_negative(new)); ino = autofs4_dentry_ino(new); ino->last_used = jiffies; dput(path->dentry); diff --git a/fs/namei.c b/fs/namei.c index f624d13..a5bcf63 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1209,6 +1209,7 @@ static int follow_managed(struct path *path, struct nameidata *nd) /* Handle an automount point */ if (managed & DCACHE_NEED_AUTOMOUNT) { ret = follow_automount(path, nd, &need_mntput); + WARN_ON(d_is_negative(path->dentry)); if (ret < 0) break; continue; @@ -1613,8 +1614,10 @@ unlazy: path->mnt = mnt; path->dentry = dentry; err = follow_managed(path, nd); - if (likely(!err)) + if (likely(!err)) { *inode = d_backing_inode(path->dentry); + WARN_ON(!inode); + } return err; need_lookup: @@ -1712,6 +1715,17 @@ static inline int should_follow_link(struct nameidata *nd, struct path *link, return 0; if (!follow) return 0; + /* make sure that d_is_symlink above matches inode */ + if (nd->flags & LOOKUP_RCU) { + if (read_seqcount_retry(&link->dentry->d_seq, seq)) { + WARN_ON(1); // just as way to report hitting + // that path; it's OK to get + // here, in the final variant + // WARN_ON will disappear. + return -ECHILD; + } + } + WARN_ON(!inode); // now, _that_ should not happen. return pick_link(nd, link, inode, seq); } @@ -1743,11 +1757,11 @@ static int walk_component(struct nameidata *nd, int flags) if (err < 0) return err; - inode = d_backing_inode(path.dentry); seq = 0; /* we are already out of RCU mode */ err = -ENOENT; if (d_is_negative(path.dentry)) goto out_path_put; + inode = d_backing_inode(path.dentry); } if (flags & WALK_PUT) @@ -3192,12 +3206,12 @@ retry_lookup: return error; BUG_ON(nd->flags & LOOKUP_RCU); - inode = d_backing_inode(path.dentry); seq = 0; /* out of RCU mode, so the value doesn't matter */ if (unlikely(d_is_negative(path.dentry))) { path_to_nameidata(&path, nd); return -ENOENT; } + inode = d_backing_inode(path.dentry); finish_lookup: if (nd->depth) put_link(nd); @@ -3206,11 +3220,6 @@ finish_lookup: if (unlikely(error)) return error; - if (unlikely(d_is_symlink(path.dentry)) && !(open_flag & O_PATH)) { - path_to_nameidata(&path, nd); - return -ELOOP; - } - if ((nd->flags & LOOKUP_RCU) || nd->path.mnt != path.mnt) { path_to_nameidata(&path, nd); } else { @@ -3229,6 +3238,10 @@ finish_open: return error; } audit_inode(nd->name, nd->path.dentry, 0); + if (unlikely(d_is_symlink(nd->path.dentry)) && !(open_flag & O_PATH)) { + error = -ELOOP; + goto out; + } error = -EISDIR; if ((open_flag & O_CREAT) && d_is_dir(nd->path.dentry)) goto out; @@ -3273,6 +3286,10 @@ opened: goto exit_fput; } out: + if (unlikely(error > 0)) { + WARN_ON(1); + error = -EINVAL; + } if (got_write) mnt_drop_write(nd->path.mnt); path_put(&save_parent); -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html