On 24.11.23 19:08, Christian Brauner wrote: > On Fri, Nov 24, 2023 at 05:47:32PM +0100, Christian Brauner wrote: >>> - Integrate this as LSM (Christian, Paul) >> >> Huh, my rant made you write an LSM. I'm not sure if that's a good or bad >> thing... >> >> So I dislike this less than the initial version that just worked around >> SB_I_NODEV and this might be able to go somewhere. _But_ I want to see the rules written down: Since we have some new Ideas, I also will try to better describe the vision and current device node interaction when submitting the next version. > > Hm, I wonder if we're being to timid or too complex in how we want to > solve this problem. > > The device cgroup management logic is hacked into multiple layers and is > frankly pretty appalling. > > What I think device access management wants to look like is that you can > implement a policy in an LSM - be it bpf or regular selinux - and have > this guarded by the main hooks: > > security_file_open() > security_inode_mknod() > > So, look at: > > vfs_get_tree() > -> security_sb_set_mnt_opts() > -> bpf_sb_set_mnt_opts() > > A bpf LSM program should be able to strip SB_I_NODEV from sb->s_iflags > today via bpf_sb_set_mnt_opts() without any kernel changes at all. > > I assume that a bpf LSM can also keep state in sb->s_security just like > selinux et al? If so then a device access management program or whatever > can be stored in sb->s_security. > > That device access management program would then be run on each call to: > > security_file_open() > -> bpf_file_open() > > and > > security_inode_mknod() > -> bpf_sb_set_mnt_opts() > > and take access security_sb_set_mnt_optsdecisions. > > This obviously makes device access management something that's tied > completely to a filesystem. So, you could have the same device node on > two tmpfs filesystems both mounted in the same userns. > > The first tmpfs has SB_I_NODEV and doesn't allow you to open that > device. The second tmpfs has a bpf LSM program attached to it that has > stripped SB_I_NODEV and manages device access and allows callers to open > that device. I like the approach to clear SB_I_NODEV in security_sb_set_mnt_opts(). I still have to sort this out but I think that was the missing piece in the current patch set. > > I guess it's even possible to restrict this on a caller basis by marking > them with a "container id" when the container is started. That can be > done with that task storage thing also via a bpf LSM hook. And then > you can further restrict device access to only those tasks that have a > specific container id in some range or some token or something. > > I might just be fantasizing abilities into bpf that it doesn't have so > anyone with the knowledge please speak up. > > If this is feasible then the only thing we need to figure out is what to > do with the legacy cgroup access management and specifically the > capable(CAP_SYS_ADMIN) check that's more of a hack than anything else. First to make this clear, we speak about CAP_SYS_MKNOD. My approach was to restructure the cgroup_device in an own cgroup_device lsm not in the current bpf lsm, to also be able to handle the legacy calls. Especially, the remaining direct calls to devcgroup_check_permission() are transformed to a new security_hook security_dev_permission() which is similar to security_inode_permission() but could be called in such places where only the dev_t representation is available. However, if we implement it that way you sketched up above, we can just leave the devcgroup_check_permission() in its current place and only implement the devcgroup_inode_permission() and devcgroup_mknode and let the blk and amd/gpu drivers as is for now, or just leave all the devcgroup_*() hooks as is. > > So, we could introduce a sysctl that makes it possible to turn this > check into ns_capable(sb->s_userns, CAP_SYS_ADMIN). Because due to > SB_I_NODEV it is inherently safe to do that. It's just that a lot of > container runtimes need to have time to adapt to a world where you may > be able to create a device but not be able to then open it. This isn't > rocket science but it will take time. True. I think a sysctl would be a good option. > > But in the end this will mean we get away with minimal kernel changes > and using a lot of existing infrastructure. > > Thoughts? For the whole bpf lsm part I'm not confident enough to make any proposition, yet. But I think an own simple devnode lsm would be feasible with the above described security_sb_set_mnt_opts() handling to get this idea realized. Maybe we go that way to implement a simple lsm without any changes to the device_cgroup and keep the devcgroup hooks in place. To implement it as bpf lsm with all full blown conatiner_id could then be done orthogonally. So from a simple container runtime perspective one could just use the simple lsm and the existing bpf device cgroup program without any change only having to activate the sysctl. A more complex cloud setup Kubernetes what so ever, could then use bpf lsm approach.