Re: Regression in 5.3 for some FS_USERNS_MOUNT (aka user-namespace-mountable) filesystems

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Sat, 27 Jul 2019 13:37:05 +0100

On Sat, Jul 27, 2019 at 06:20:18AM -0500, Eric W. Biederman wrote:

> > In principle I like killing FS_USERNS_MOUNT flag, but when a method
> > is always either NULL or exact same function...
> 
> Either you are being dramatic or you read the patch much too quickly.

Or you've done the same to my reply.  I'm not saying that in the
posted form the instances are the same; I'm saying that they are
all equivalent to exact same code - ns_capable(fc->user_ns, CAP_SYS_ADMIN),
that is.

> userns_mount_permission covers the common case of FS_USERNS_MOUNT.
> Then there are the cases where you need to know how the filesystem is
> going to map current into the filesystem that will be mounted.  Those
> are: proc_mount_permission, sysfs_mount_permission,
> mqueue_mount_permission, cgroup_mount_permission,

+static int proc_mount_permission(void)
+{
+       struct pid_namespace *ns = task_active_pid_ns(current);
+       return ns_capable(ns->user_ns, CAP_SYS_ADMIN) ? 0 : -EPERM;
+}

compare with
        ctx->pid_ns = get_pid_ns(task_active_pid_ns(current));
        put_user_ns(fc->user_ns);
        fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
in proc_init_fs_context().  Notice anything?  With those 'orrible
API changes proc_mount_permissions() and userns_mount_permission()
can actually become identical.  Because the default case is
to leave fc->user_ns set to current_user_ns().

cgroup case:
+static int cgroup_mount_permission(void)
+{
+       struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
+       return ns_capable(ns->user_ns, CAP_SYS_ADMIN) ? 0 : -EPERM;
+}

with
        ctx->ns = current->nsproxy->cgroup_ns;
...
        put_user_ns(fc->user_ns);
        fc->user_ns = get_user_ns(ctx->ns->user_ns);
in cgroup_init_fs_context().  IOW, again your instances fold together.

mqueue:
        ctx->ipc_ns = get_ipc_ns(current->nsproxy->ipc_ns);
        put_user_ns(fc->user_ns);
        fc->user_ns = get_user_ns(ctx->ipc_ns->user_ns);
in mqueue_init_fs_context() and
+static int mqueue_mount_permission(void)
+{
+       struct ipc_namespace *ns = current->nsproxy->ipc_ns;
+       return ns_capable(ns->user_ns, CAP_SYS_ADMIN) ? 0 : -EPERM;
+}

Same situation.

+static int sysfs_mount_permission(void)
+{
+       struct net *net = current->nsproxy->net_ns;
+       return ns_capable(net->user_ns, CAP_SYS_ADMIN) ? 0 : -EPERM;
+}

with
        kfc->ns_tag = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
...
        if (netns) {
                put_user_ns(fc->user_ns);
                fc->user_ns = get_user_ns(netns->user_ns);
        }
Now, _that_ is interesting.  Here the variants diverge - in case
USER_NS && !NET_NS fc->user_ns is left at current_user_ns().

That, BTW, is the only case where the old checks are left in:
        if (!(fc->sb_flags & SB_KERNMOUNT)) {
                if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
                        return -EPERM;
        }
in sysfs_init_fs_context() papers over that case and I'd love to
get rid of that irregularity.  Which, AFAICS, can be done by
unconditional
	put_user_ns(fc->user_ns);
	fc->user_ns = get_user_ns(current->nsproxy->net_ns->user_ns);
whatever the .config we have.  On NET_NS ones it'll be identical
to ->user_ns of what kobj_ns_grab_current(KOBJ_NS_TYPE_NET) returns,
on !NET_NS it'll get mount_capable() do the right thing without
that wart with init_fs_context() doing that extra check.

> So yes I agree the function of interest is always capable in some form,
> we just need the filesystem specific logic to check to see if we will
> have capable over the filesystem that will be mounted.
> 
> I don't doubt that the new mount api has added a few new complexities.

So far it looks like *in this particular case* complexities would be
reduced - with one exception all your ->permission() instances become
identical.

Moreover, even in that case we still get the right overall behaviour
with the same instance...

So do you have any specific objections to behaviour in vfs.git #fixes
and/or to adding

diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index db81cfbab9d6..c5b3c7d4d360 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -57,11 +57,6 @@ static int sysfs_init_fs_context(struct fs_context *fc)
 	struct kernfs_fs_context *kfc;
 	struct net *netns;
 
-	if (!(fc->sb_flags & SB_KERNMOUNT)) {
-		if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
-			return -EPERM;
-	}
-
 	kfc = kzalloc(sizeof(struct kernfs_fs_context), GFP_KERNEL);
 	if (!kfc)
 		return -ENOMEM;
@@ -71,10 +66,8 @@ static int sysfs_init_fs_context(struct fs_context *fc)
 	kfc->magic = SYSFS_MAGIC;
 	fc->fs_private = kfc;
 	fc->ops = &sysfs_fs_context_ops;
-	if (netns) {
-		put_user_ns(fc->user_ns);
-		fc->user_ns = get_user_ns(netns->user_ns);
-	}
+	put_user_ns(fc->user_ns);
+	fc->user_ns = get_user_ns(current->nsproxy->net_ns->user_ns);
 	fc->global = true;
 	return 0;
 }

on top of that?  The last one shouldn't change the behaviour, but it
certainly looks like a nice wartectomy; not #fixes material, though.