If a container manager restricts its unprivileged (user namespaced) children by a device cgroup, it is not necessary to deny mknod() anymore. Thus, user space applications may map devices on different locations in the file system by using mknod() inside the container. A use case for this, we also use in GyroidOS, is to run virsh for VMs inside an unprivileged container. virsh creates device nodes, e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails in a non-initial userns, even if a cgroup device white list with the corresponding major, minor of /dev/null exists. Thus, in this case the usual bind mounts or pre populated device nodes under /dev are not sufficient. Due to the discussion with Christian on v2, I agree that the previous approach was to complex. Actually, we just want working device nodes in user namespace if we have a device cgroup in place which handles access decisions. Patch 1 provides a helper functions to check if the current task is guarded by a bpf-device cgroup program. Thanks Alexander Mikhalitsyn for reviewing. Patch 2 implements the ns_capable check including sysctl as proposed by Christian. I provide a short overview about device node creation and access decisions in the commit message there. Patch 3 provides devgard, a small lsm which actually strips out SB_I_NODEV. --- Changes in v3: - Small LSM to just implement security_inode_mknod() hook - Leave devcgroup as is - Strip SB_I_NO_DEV in security_inode_mknod hook as suggested by Christian - Do not change bpf or cgroup access decision at all - ns_capable(sb->s_iflags, CAP_MKNOD) in vfs_mknod() - Link to v2: https://lore.kernel.org/lkml/1d481e11-6601-4b82-a317-f8506f3ccf9b@xxxxxxxxxxxxxxxxxxx/ Changes in v2: - Integrate this as LSM (Christian, Paul) - Switched to a device cgroup specific flag instead of a generic bpf program flag (Christian) - Do not ignore SB_I_NODEV in fs/namei.c but use LSM hook in sb_alloc_super in fs/super.c - Link to v1: https://lore.kernel.org/lkml/20230814-devcg_guard-v1-0-654971ab88b1@xxxxxxxxxxxxxxxxxxx Michael Weiß (3): bpf: cgroup: Introduce helper cgroup_bpf_current_enabled() fs: Make vfs_mknod() to check CAP_MKNOD in user namespace of sb devguard: added device guard for mknod in non-initial userns fs/namei.c | 30 +++++++++++++++++++++++- include/linux/bpf-cgroup.h | 2 ++ kernel/bpf/cgroup.c | 14 ++++++++++++ security/Kconfig | 11 +++++---- security/Makefile | 1 + security/devguard/Kconfig | 12 ++++++++++ security/devguard/Makefile | 2 ++ security/devguard/devguard.c | 44 ++++++++++++++++++++++++++++++++++++ 8 files changed, 110 insertions(+), 6 deletions(-) create mode 100644 security/devguard/Kconfig create mode 100644 security/devguard/Makefile create mode 100644 security/devguard/devguard.c base-commit: a39b6ac3781d46ba18193c9dbb2110f31e9bffe9 -- 2.30.2