On Fri, Aug 18, 2023 at 12:11 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Tue, Aug 15, 2023 at 10:59:22AM +0200, Christian Brauner wrote: > > On Mon, Aug 14, 2023 at 04:26:09PM +0200, Michael Weiß wrote: > > > Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD > > > which allows to set a cgroup device program to be a device guard. > > > > Currently we block access to devices unconditionally in may_open_dev(). > > Anything that's mounted by an unprivileged containers will get > > SB_I_NODEV set in s_i_flags. > > > > Then we currently mediate device access in: > > > > * inode_permission() > > -> devcgroup_inode_permission() > > * vfs_mknod() > > -> devcgroup_inode_mknod() > > * blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends > > -> devcgroup_check_permission() > > * drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict > > -> devcgroup_check_permission() > > > > All your new flag does is to bypass that SB_I_NODEV check afaict and let > > it proceed to the devcgroup_*() checks for the vfs layer. > > > > But I don't get the semantics yet. > > Is that a flag which is set on BPF_PROG_TYPE_CGROUP_DEVICE programs or > > is that a flag on random bpf programs? It looks like it would be the > > latter but design-wise I would expect this to be a property of the > > device program itself. > > Looks like patch 4 attemps to bypass usual permission checks with: > @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir, > if (error) > return error; > > - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout && > - !capable(CAP_MKNOD)) > - return -EPERM; > + /* > + * In case of a device cgroup restirction allow mknod in user > + * namespace. Otherwise just check global capability; thus, > + * mknod is also disabled for user namespace other than the > + * initial one. > + */ > + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) { > + if (devcgroup_task_is_guarded(current)) { > + if (!ns_capable(current_user_ns(), CAP_MKNOD)) > + return -EPERM; > + } else if (!capable(CAP_MKNOD)) > + return -EPERM; > + } > Dear colleagues, > which pretty much sounds like authoritative LSM that was brought up in the past > and LSM folks didn't like it. Thanks for pointing this out, Alexei! I've searched through the LKML archives and found a thread about this: https://lore.kernel.org/all/CAEf4BzaBt0W3sWh_L4RRXEFYdBotzVEnQdqC7BO+PNWtD7eSUA@xxxxxxxxxxxxxx/ As far as I understand, disagreement here is about a practice of skipping kernel-built capability checks based on LSM hooks, right? +CC Paul Moore <paul@xxxxxxxxxxxxxx> > > If vfs folks are ok with this special bypass of permissions in vfs_mknod() > we can talk about kernel->bpf api details. > The way it's done with BPF_F_CGROUP_DEVICE_GUARD flag is definitely no go, > but no point going into bpf details now until agreement on bypass is made. +CC Miklos Szeredi <miklos@xxxxxxxxxx> +CC Amir Goldstein <amir73il@xxxxxxxxx> Kind regards, Alex