Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes: > On Sun, Apr 02, 2017 at 05:58:41PM -0700, Linus Torvalds wrote: > >> I had to go and double-check that "DCACHE_DIRECTORY_TYPE" is what >> d_can_lookup() actually checks, so _that_ part is perhaps a bit >> subtle, and might be worth noting in that comment that you edited. >> >> So the real "rule" ends up being that we only ever look up things from >> dentries of type DCACHE_DIRECTORY_TYPE set, and those had better have >> DCACHE_RCUACCESS bit set. >> >> And the only reason path_init() only checks it for that case is that >> nd->root and nd->pwd both have to be of type d_can_lookup(). >> >> Do we check that when we set it? I hope/assume we do. > > For chdir()/chroot()/pivot_root() it's done by LOOKUP_DIRECTORY in lookup > flags; fchdir() is slightly different - there we check S_ISDIR of inode > of opened file. Which is almost the same thing, except for > kinda-sorta directories that have no ->lookup() - we have them for > NFS referral points. It should be impossible to end up with > one of those opened - not even with O_PATH; follow_managed() will be called > and we'll either fail or cross into whatever ends up overmounting them. > Still, it might be cleaner to turn that check into > d_can_lookup(f.file->f_path.dentry) > simply for consistency sake. > > The thing I really don't like is mntns_install(). With sufficiently > nasty nfsroot setup it might be possible to end up with referral point > as one's root/pwd; getting out of such state would be interesting... > Smells like that place should be a solitary follow_down(), not a loop > of follow_down_one(), but I want Eric's opinion on that one; userns stuff > is weird. If I read the conversation correctly the concern is that we might initialize a pwd or root with something that is almost but not quite a directory in mntns_install. Refereshing my memory. d_automount mounts things and is what is used for nfs referrals. d_manage blocks waiting for an automounts to complete or expire. follow_down just calls d_manage, follow_manage calls both d_manage and d_automount as appropriate. If the concern is nfs referral points calling follow_down is wrong and what is wanted is follow_managed. The only thing that follow_down prevents is changing onto directories that are only half mounted, and not really directories yet. Which is certainly part of the invarient we want to preserve. The intent of the logic in mntns_install is to just pick a reasonable looking place somewhere in that mount namespace to use as a root directory. I arbitrarily picked the top of the mount stack on "/". Which is typically used as the root directory. If people really care where their root is they save a directory file descriptor off somewhere and call chroot. So there is a little wiggle room exactly what the code does. There is a secondary use of mntns_install which is to give you a way to access what is under "/" if you are so foolish as to umount "/". I keep thinking setns to your own mount namespace would be a handy way to get back to the rootfs and to use it for something during system shutdown. I don't know if anyone has actually used setns to your own mount namespace for that. The worst case I can see from the proposed change is we would not be able to umount all of the way down to rootfs. That would be a self inflicted wound so I don't care. I can't imagine anyone mounting an automount point deliberately on / except as way to confuse the vfs. Though I can almost imagine getting there by accident if an automount expires. So yes please let's change the follow_down_one loop to follow_managed to preserve the invariant that we always have a directory that supports d_can_lookup to pass to set_fs_pwd and set_fs_root. Eric > diff --git a/fs/dcache.c b/fs/dcache.c > index 95d71eda8142..05550139a8a6 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -1757,7 +1757,13 @@ static unsigned d_flags_for_inode(struct inode *inode) > return DCACHE_MISS_TYPE; > > if (S_ISDIR(inode->i_mode)) { > - add_flags = DCACHE_DIRECTORY_TYPE; > + /* > + * Any potential starting point of lookup should have > + * DCACHE_RCUACCESS; currently directory dentries > + * come from d_alloc() anyway, but it costs us nothing > + * to enforce it here. > + */ > + add_flags = DCACHE_DIRECTORY_TYPE | DCACHE_RCUACCESS; > if (unlikely(!(inode->i_opflags & IOP_LOOKUP))) { > if (unlikely(!inode->i_op->lookup)) > add_flags = DCACHE_AUTODIR_TYPE; > diff --git a/fs/namei.c b/fs/namei.c > index d41fab78798b..19dcf62133cc 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -2145,6 +2145,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags) > int retval = 0; > const char *s = nd->name->name; > > + if (!*s) > + flags &= ~LOOKUP_RCU; > + > nd->last_type = LAST_ROOT; /* if there are only slashes... */ > nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT; > nd->depth = 0; > diff --git a/fs/namespace.c b/fs/namespace.c > index cc1375eff88c..31ec9a79d2d4 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -3467,6 +3467,7 @@ static int mntns_install(struct nsproxy *nsproxy, struct ns_common *ns) > struct fs_struct *fs = current->fs; > struct mnt_namespace *mnt_ns = to_mnt_ns(ns); > struct path root; > + int err; > > if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) || > !ns_capable(current_user_ns(), CAP_SYS_CHROOT) || > @@ -3484,15 +3485,14 @@ static int mntns_install(struct nsproxy *nsproxy, struct ns_common *ns) > root.mnt = &mnt_ns->root->mnt; > root.dentry = mnt_ns->root->mnt.mnt_root; > path_get(&root); > - while(d_mountpoint(root.dentry) && follow_down_one(&root)) > - ; > - > - /* Update the pwd and root */ > - set_fs_pwd(fs, &root); > - set_fs_root(fs, &root); > - > + err = follow_down(&root); > + if (likely(!err)) { > + /* Update the pwd and root */ > + set_fs_pwd(fs, &root); > + set_fs_root(fs, &root); > + } > path_put(&root); > - return 0; > + return err; > } > > static struct user_namespace *mntns_owner(struct ns_common *ns) > diff --git a/fs/open.c b/fs/open.c > index 949cef29c3bb..217b5db588c8 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -459,20 +459,17 @@ SYSCALL_DEFINE1(chdir, const char __user *, filename) > SYSCALL_DEFINE1(fchdir, unsigned int, fd) > { > struct fd f = fdget_raw(fd); > - struct inode *inode; > int error = -EBADF; > > error = -EBADF; > if (!f.file) > goto out; > > - inode = file_inode(f.file); > - > error = -ENOTDIR; > - if (!S_ISDIR(inode->i_mode)) > + if (!d_can_lookup(f.file->f_path.dentry)) > goto out_putf; > > - error = inode_permission(inode, MAY_EXEC | MAY_CHDIR); > + error = inode_permission(file_inode(f.file), MAY_EXEC | MAY_CHDIR); > if (!error) > set_fs_pwd(current->fs, &f.file->f_path); > out_putf: