On Sun, Nov 15, 2020 at 5:39 AM Christian Brauner <christian.brauner@xxxxxxxxxx> wrote: > When interacting with user namespace and non-user namespace aware > filesystem capabilities the vfs will perform various security checks to > determine whether or not the filesystem capabilities can be used by the > caller (e.g. during exec), or even whether they need to be removed. The > main infrastructure for this resides in the capability codepaths but they > are called through the LSM security infrastructure even though they are not > technically an LSM or optional. This extends the existing security hooks > security_inode_removexattr(), security_inode_killpriv(), > security_inode_getsecurity() to pass down the mount's user namespace and > makes them aware of idmapped mounts. > In order to actually get filesystem capabilities from disk the capability > infrastructure exposes the get_vfs_caps_from_disk() helper. For user > namespace aware filesystem capabilities a root uid is stored alongside the > capabilities. > In order to determine whether the caller can make use of the filesystem > capability or whether it needs to be ignored it is translated according to > the superblock's user namespace. If it can be translated to uid 0 according > to that id mapping the caller can use the filesystem capabilities stored on > disk. If we are accessing the inode that holds the filesystem capabilities > through an idmapped mount we need to map the root uid according to the > mount's user namespace. > Afterwards the checks are identical to non-idmapped mounts. Reading > filesystem caps from disk enforces that the root uid associated with the > filesystem capability must have a mapping in the superblock's user > namespace and that the caller is either in the same user namespace or is a > descendant of the superblock's user namespace. For filesystems that are > mountable inside user namespace the container can just mount the filesystem > and won't usually need to idmap it. If it does create an idmapped mount it > can mark it with a user namespace it has created and which is therefore a > descendant of the s_user_ns. For filesystems that are not mountable inside > user namespaces the descendant rule is trivially true because the s_user_ns > will be the initial user namespace. > > If the initial user namespace is passed all operations are a nop so > non-idmapped mounts will not see a change in behavior and will also not see > any performance impact. > > Cc: Christoph Hellwig <hch@xxxxxx> > Cc: David Howells <dhowells@xxxxxxxxxx> > Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > Cc: linux-fsdevel@xxxxxxxxxxxxxxx > Signed-off-by: Christian Brauner <christian.brauner@xxxxxxxxxx> ... > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > index 8dba8f0983b5..ddb9213a3e81 100644 > --- a/kernel/auditsc.c > +++ b/kernel/auditsc.c > @@ -1944,7 +1944,7 @@ static inline int audit_copy_fcaps(struct audit_names *name, > if (!dentry) > return 0; > > - rc = get_vfs_caps_from_disk(dentry, &caps); > + rc = get_vfs_caps_from_disk(&init_user_ns, dentry, &caps); > if (rc) > return rc; > > @@ -2495,7 +2495,8 @@ int __audit_log_bprm_fcaps(struct linux_binprm *bprm, > ax->d.next = context->aux; > context->aux = (void *)ax; > > - get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps); > + get_vfs_caps_from_disk(mnt_user_ns(bprm->file->f_path.mnt), > + bprm->file->f_path.dentry, &vcaps); As audit currently records information in the context of the initial/host namespace I'm guessing we don't want the mnt_user_ns() call above; it seems like &init_user_ns would be the right choice (similar to audit_copy_fcaps()), yes? > ax->fcap.permitted = vcaps.permitted; > ax->fcap.inheritable = vcaps.inheritable; -- paul moore www.paul-moore.com