On Fri, Nov 24, 2023 at 05:47:32PM +0100, Christian Brauner wrote: > > - Integrate this as LSM (Christian, Paul) > > Huh, my rant made you write an LSM. I'm not sure if that's a good or bad > thing... > > So I dislike this less than the initial version that just worked around Hm, I wonder if we're being to timid or too complex in how we want to solve this problem. The device cgroup management logic is hacked into multiple layers and is frankly pretty appalling. What I think device access management wants to look like is that you can implement a policy in an LSM - be it bpf or regular selinux - and have this guarded by the main hooks: security_file_open() security_inode_mknod() So, look at: vfs_get_tree() -> security_sb_set_mnt_opts() -> bpf_sb_set_mnt_opts() A bpf LSM program should be able to strip SB_I_NODEV from sb->s_iflags today via bpf_sb_set_mnt_opts() without any kernel changes at all. I assume that a bpf LSM can also keep state in sb->s_security just like selinux et al? If so then a device access management program or whatever can be stored in sb->s_security. That device access management program would then be run on each call to: security_file_open() -> bpf_file_open() and security_inode_mknod() -> bpf_sb_set_mnt_opts() and take access decisions. This obviously makes device access management something that's tied completely to a filesystem. So, you could have the same device node on two tmpfs filesystems both mounted in the same userns. The first tmpfs has SB_I_NODEV and doesn't allow you to open that device. The second tmpfs has a bpf LSM program attached to it that has stripped SB_I_NODEV and manages device access and allows callers to open that device. I guess it's even possible to restrict this on a caller basis by marking them with a "container id" when the container is started. That can be done with that task storage thing also via a bpf LSM hook. And then you can further restrict device access to only those tasks that have a specific container id in some range or some token or something. I might just be fantasizing abilities into bpf that it doesn't have so anyone with the knowledge please speak up. If this is feasible then the only thing we need to figure out is what to do with the legacy cgroup access management and specifically the capable(CAP_SYS_ADMIN) check that's more of a hack than anything else. So, we could introduce a sysctl that makes it possible to turn this check into ns_capable(sb->s_userns, CAP_SYS_ADMIN). Because due to SB_I_NODEV it is inherently safe to do that. It's just that a lot of container runtimes need to have time to adapt to a world where you may be able to create a device but not be able to then open it. This isn't rocket science but it will take time. But in the end this will mean we get away with minimal kernel changes and using a lot of existing infrastructure. Thoughts?