Re: [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 17 Aug 2023 15:11:02 -0700




On Tue, Aug 15, 2023 at 10:59:22AM +0200, Christian Brauner wrote:
> On Mon, Aug 14, 2023 at 04:26:09PM +0200, Michael Weiß wrote:
> > Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
> > which allows to set a cgroup device program to be a device guard.
> 
> Currently we block access to devices unconditionally in may_open_dev().
> Anything that's mounted by an unprivileged containers will get
> SB_I_NODEV set in s_i_flags.
> 
> Then we currently mediate device access in:
> 
> * inode_permission()
>   -> devcgroup_inode_permission()
> * vfs_mknod()
>   -> devcgroup_inode_mknod()
> * blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends
>   -> devcgroup_check_permission()
> * drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict
>   -> devcgroup_check_permission()
> 
> All your new flag does is to bypass that SB_I_NODEV check afaict and let
> it proceed to the devcgroup_*() checks for the vfs layer.
> 
> But I don't get the semantics yet.
> Is that a flag which is set on BPF_PROG_TYPE_CGROUP_DEVICE programs or
> is that a flag on random bpf programs? It looks like it would be the
> latter but design-wise I would expect this to be a property of the
> device program itself.

Looks like patch 4 attemps to bypass usual permission checks with:
@@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
        if (error)
                return error;

-       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
-           !capable(CAP_MKNOD))
-               return -EPERM;
+       /*
+        * In case of a device cgroup restirction allow mknod in user
+        * namespace. Otherwise just check global capability; thus,
+        * mknod is also disabled for user namespace other than the
+        * initial one.
+        */
+       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) {
+               if (devcgroup_task_is_guarded(current)) {
+                       if (!ns_capable(current_user_ns(), CAP_MKNOD))
+                               return -EPERM;
+               } else if (!capable(CAP_MKNOD))
+                       return -EPERM;
+       }

which pretty much sounds like authoritative LSM that was brought up in the past
and LSM folks didn't like it.

If vfs folks are ok with this special bypass of permissions in vfs_mknod()
we can talk about kernel->bpf api details.
The way it's done with BPF_F_CGROUP_DEVICE_GUARD flag is definitely no go,
but no point going into bpf details now until agreement on bypass is made.