Re: [RESEND RFC PATCH v2 00/14] device_cgroup: guard mknod for non-initial user namespace

Christian Brauner <brauner@xxxxxxxxxx> · Fri, 24 Nov 2023 19:08:50 +0100

On Fri, Nov 24, 2023 at 05:47:32PM +0100, Christian Brauner wrote:
> > - Integrate this as LSM (Christian, Paul)
> 
> Huh, my rant made you write an LSM. I'm not sure if that's a good or bad
> thing...
> 
> So I dislike this less than the initial version that just worked around

Hm, I wonder if we're being to timid or too complex in how we want to
solve this problem.

The device cgroup management logic is hacked into multiple layers and is
frankly pretty appalling.

What I think device access management wants to look like is that you can
implement a policy in an LSM - be it bpf or regular selinux - and have
this guarded by the main hooks:

security_file_open()
security_inode_mknod()

So, look at:

vfs_get_tree()
-> security_sb_set_mnt_opts()
   -> bpf_sb_set_mnt_opts()

A bpf LSM program should be able to strip SB_I_NODEV from sb->s_iflags
today via bpf_sb_set_mnt_opts() without any kernel changes at all.

I assume that a bpf LSM can also keep state in sb->s_security just like
selinux et al? If so then a device access management program or whatever
can be stored in sb->s_security.

That device access management program would then be run on each call to:

security_file_open()
-> bpf_file_open()

and

security_inode_mknod()
-> bpf_sb_set_mnt_opts()

and take access decisions.

This obviously makes device access management something that's tied
completely to a filesystem. So, you could have the same device node on
two tmpfs filesystems both mounted in the same userns.

The first tmpfs has SB_I_NODEV and doesn't allow you to open that
device. The second tmpfs has a bpf LSM program attached to it that has
stripped SB_I_NODEV and manages device access and allows callers to open
that device.

I guess it's even possible to restrict this on a caller basis by marking
them with a "container id" when the container is started. That can be
done with that task storage thing also via a bpf LSM hook. And then
you can further restrict device access to only those tasks that have a
specific container id in some range or some token or something.

I might just be fantasizing abilities into bpf that it doesn't have so
anyone with the knowledge please speak up.

If this is feasible then the only thing we need to figure out is what to
do with the legacy cgroup access management and specifically the
capable(CAP_SYS_ADMIN) check that's more of a hack than anything else.

So, we could introduce a sysctl that makes it possible to turn this
check into ns_capable(sb->s_userns, CAP_SYS_ADMIN). Because due to
SB_I_NODEV it is inherently safe to do that. It's just that a lot of
container runtimes need to have time to adapt to a world where you may
be able to create a device but not be able to then open it. This isn't
rocket science but it will take time.

But in the end this will mean we get away with minimal kernel changes
and using a lot of existing infrastructure.

Thoughts?