Check CAP_MKNOD for user namespace of sb with ns_cabable() in fs/namei.c. This will allow lsm-based guarding of device node creation in non-initial user namespace by stripping out SB_I_NODEV for mounts in its own namespace. Currently, device access is blocked unconditionally in may_open_dev() and mounts inside unprivileged user namespaces get SB_I_NODEV set in sb->s_iflags causing open() to fail with -EACCES. Device access by cgroups is mediated in the following places 1) fs/namei.c: inode_permission() -> devcgroup_inode_permission vfs_mknod() and -> devcgroup_inode_mknod 2) block/bdev.c: blkdev_get_by_dev() -> devcgroup_check_permission 3) drivers/gpu/drm/amd/amdkfd/kfd_priv.h: kfd_devcgroup_check_permission -> devcgroup_check_permission We leave this all in place. However, a lsm now can implement the security hook security_inode_mknod() which is called directly after the devcgroup_inode_mknod() in vfs_mknod() and remove the SB_I_NODEV. This will let the call to may_open_dev() during open() succeed. Turning the check form capable(CAP_MKNOD) to ns_capable(sb->s_userns, CAP_MKNOD) is inherently save due to SB_I_NODEV. However, this may allow to create device nodes which then could not be opened. To give user space some time to adopt, we introduce a sysctl knob which must be explicitly set to "1" to activate the use of ns_capable(). Otherwise, we just check the global capability for the current task as before. I tested this approach in a GyroidOS container using the small devguard LSM of the followup commit. Signed-off-by: Michael Weiß <michael.weiss@xxxxxxxxxxxxxxxxxxx> --- fs/namei.c | 30 +++++++++++++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/fs/namei.c b/fs/namei.c index 71c13b2990b4..cc61545e02ce 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1032,6 +1032,7 @@ static int sysctl_protected_symlinks __read_mostly; static int sysctl_protected_hardlinks __read_mostly; static int sysctl_protected_fifos __read_mostly; static int sysctl_protected_regular __read_mostly; +static int sysctl_nscap_mknod __read_mostly; #ifdef CONFIG_SYSCTL static struct ctl_table namei_sysctls[] = { @@ -1071,6 +1072,15 @@ static struct ctl_table namei_sysctls[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_TWO, }, + { + .procname = "nscap_mknod", + .data = &sysctl_nscap_mknod, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, { } }; @@ -3940,6 +3950,24 @@ inline struct dentry *user_path_create(int dfd, const char __user *pathname, } EXPORT_SYMBOL(user_path_create); +/** + * sb_mknod_capable - check userns of sb for CAP_MKNOD + * @sb: super block to which userns CAP_MKNOD should be checked + * + * Check userns of sb for CAP_MKNOD + * + * Check CAP_MKNOD for owning user namespace of sb if corresponding sysctl is set. + * Otherwise just check global capability for current task. This allows + * lsm-based guarding of device node creation in non-initial user namespace. + */ +static bool sb_mknod_capable(struct super_block *sb) +{ + struct user_namespace *user_ns; + + user_ns = sysctl_nscap_mknod ? sb->s_user_ns : &init_user_ns; + return ns_capable(user_ns, CAP_MKNOD); +} + /** * vfs_mknod - create device node or file * @idmap: idmap of the mount the inode was found from @@ -3966,7 +3994,7 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir, return error; if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout && - !capable(CAP_MKNOD)) + !sb_mknod_capable(dentry->d_sb)) return -EPERM; if (!dir->i_op->mknod) -- 2.30.2