Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD which allows to set a cgroup device program to be a device guard. This may be used to guard actions on device nodes in non-initial userns, e.g., mknod. If a container manager restricts its unprivileged (user namespaced) children by a device cgroup, it is not necessary to deny mknod anymore. Thus, user space applications may map devices on different locations in the file system by using mknod() inside the container. A use case for this, we also use in GyroidOS, is to run virsh for VMs inside an unprivileged container. virsh creates device nodes, e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails in a non-initial userns, even if a cgroup device white list with the corresponding major, minor of /dev/null exists. Thus, in this case the usual bind mounts or pre populated device nodes under /dev are not sufficient. To circumvent this limitation, we allow mknod() in the VFS if a bpf cgroup device guard is enabled for the current task and check CAP_MKNOD for the current user namespace instead of the init userns. To avoid unusable device nodes on file systems mounted in non-initial user namespace, may_open_dev() ignores the SB_I_NODEV for cgroup device guarded tasks. Tested for a GyroidOS container generated by the cmld using the following user space patch: https://github.com/gyroidos/cml/pull/394 I discussed this internally with Christian in the UAPI group, earlier. I put this to the public list now, since also LXC/LXD Folks have announced interest on this. This series applies to the latest mainline v6.5-rc6 tag. Signed-off-by: Michael Weiß <michael.weiss@xxxxxxxxxxxxxxxxxxx> --- Michael Weiß (4): bpf: add cgroup device guard to flag a cgroup device prog bpf: provide cgroup_device_guard in bpf_prog_info to user space device_cgroup: wrapper for bpf cgroup device guard fs: allow mknod in non-initial userns using cgroup device guard fs/namei.c | 19 ++++++++++++++++--- include/linux/bpf-cgroup.h | 7 +++++++ include/linux/bpf.h | 1 + include/linux/device_cgroup.h | 7 +++++++ include/uapi/linux/bpf.h | 8 +++++++- kernel/bpf/cgroup.c | 30 ++++++++++++++++++++++++++++++ kernel/bpf/syscall.c | 6 +++++- security/device_cgroup.c | 10 ++++++++++ tools/bpf/bpftool/prog.c | 2 ++ tools/include/uapi/linux/bpf.h | 8 +++++++- 10 files changed, 92 insertions(+), 6 deletions(-) --- base-commit: 2ccdd1b13c591d306f0401d98dedc4bdcd02b421 change-id: 20230814-devcg_guard-5398ef84bf7b Best regards, -- Michael Weiß <michael.weiss@xxxxxxxxxxxxxxxxxxx>