On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee@xxxxxxxxxxxxx> wrote: > > shiftfs mounts cannot be nested for two reasons -- global > CAP_SYS_ADMIN is required to set up a mark mount, and a single > functional shiftfs mount meets the filesystem stacking depth > limit. > > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel > ids in a mount must be within that mount's s_user_ns, so all that > is needed is CAP_SYS_ADMIN within that s_user_ns. > > The stack depth issue can be worked around with a couple of > adjustments. First, a mark mount doesn't really need to count > against the stacking depth as it doesn't contribute to the call > stack depth during filesystem operations. Therefore the mount > over the mark mount only needs to count as one more than the > lower filesystems stack depth. That's true, but it also highlights the point that the "mark" sb is completely unneeded and it really is just a nice little hack. All the information it really stores is a lower mount reference, a lower root dentry and a declarative statement "I am shiftable!". Come to think about it, "container shiftable" really isn't that different from NFS export with "no_root_squash" and auto mounted USB drive. I mean the shifting itself is different of course, but the declaration, not so much. If I am allowing sudoers on another machine to mess with root owned files visible on my machine, I am pretty much have the same issues as container admins accessing root owned files on my init_user_ns filesystem. In all those cases, I'd better not be executing suid binaries from the untrusted "external" source. Instead of mounting a dummy filesystem to make the declaration, you could get the same thing with: mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC) and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED or whatnot) constant to uapi and manage to come up good man page description. Then users could actually mount a filesystem in init_user_ns MS_EXTERN and avoid the extra bind mount step (for a full filesystem tree export). Declaring a mounted image MS_EXTERN has merits on its own even without containers and shitfs, for example with pluggable storage. Other LSMs could make good use of that declaration. > > Second, when the lower mount is shiftfs we can just skip down to > that mount's lower filesystem and shift ids relative to that. > There is no reason to shift ids twice, and the lower path has > already been marked safe for id shifting by a user privileged > towards all ids in that mount's user ns. > > Signed-off-by: Seth Forshee <seth.forshee@xxxxxxxxxxxxx> > --- > fs/shiftfs.c | 68 +++++++++++++++++++++++++++++++++++----------------- > 1 file changed, 46 insertions(+), 22 deletions(-) > > diff --git a/fs/shiftfs.c b/fs/shiftfs.c > index b19af7b2fe75..008ace2842b9 100644 > --- a/fs/shiftfs.c > +++ b/fs/shiftfs.c > @@ -930,7 +930,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data, > struct shiftfs_data *data = raw_data; > char *name = kstrdup(data->path, GFP_KERNEL); > int err = -ENOMEM; > - struct shiftfs_super_info *ssi = NULL; > + struct shiftfs_super_info *ssi = NULL, *mp_ssi; > struct path path; > struct dentry *dentry; > > @@ -946,11 +946,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data, > if (err) > goto out; > > - /* to mark a mount point, must be real root */ > - if (ssi->mark && !capable(CAP_SYS_ADMIN)) > - goto out; > - > - /* else to mount a mark, must be userns admin */ > + /* to mount a mark, must be userns admin */ > if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN)) > goto out; Isn't this check performed by vfs anyway? i.e. in mount_nodev() -> sget() > > @@ -962,41 +958,66 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data, > > if (!S_ISDIR(path.dentry->d_inode->i_mode)) { > err = -ENOTDIR; > - goto out_put; > - } > - > - sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1; > - if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) { > - printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n"); > - err = -EINVAL; > - goto out_put; > + goto out_put_path; > } > > if (ssi->mark) { > + struct super_block *lower_sb = path.mnt->mnt_sb; > + > + /* to mark a mount point, must root wrt lower s_user_ns */ > + if (!ns_capable(lower_sb->s_user_ns, CAP_SYS_ADMIN)) > + goto out_put_path; > + > + > /* > * this part is visible unshifted, so make sure no > * executables that could be used to give suid > * privileges > */ > sb->s_iflags = SB_I_NOEXEC; As commented on cover letter, why allow access to any files besides root at all? In fact, the only justification for a dummy sb (instead of bind mount with MS_EXTERN flag) would be in order to override inode operations with noop ops to prevent access to unshifted files from within container. Thanks, Amir.