On Tue, Aug 29, 2023 at 03:35:46PM +0200, Alexander Mikhalitsyn wrote: > On Fri, Aug 18, 2023 at 12:11 AM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > On Tue, Aug 15, 2023 at 10:59:22AM +0200, Christian Brauner wrote: > > > On Mon, Aug 14, 2023 at 04:26:09PM +0200, Michael Weiß wrote: > > > > Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD > > > > which allows to set a cgroup device program to be a device guard. > > > > > > Currently we block access to devices unconditionally in may_open_dev(). > > > Anything that's mounted by an unprivileged containers will get > > > SB_I_NODEV set in s_i_flags. > > > > > > Then we currently mediate device access in: > > > > > > * inode_permission() > > > -> devcgroup_inode_permission() > > > * vfs_mknod() > > > -> devcgroup_inode_mknod() > > > * blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends > > > -> devcgroup_check_permission() > > > * drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict > > > -> devcgroup_check_permission() > > > > > > All your new flag does is to bypass that SB_I_NODEV check afaict and let > > > it proceed to the devcgroup_*() checks for the vfs layer. > > > > > > But I don't get the semantics yet. > > > Is that a flag which is set on BPF_PROG_TYPE_CGROUP_DEVICE programs or > > > is that a flag on random bpf programs? It looks like it would be the > > > latter but design-wise I would expect this to be a property of the > > > device program itself. > > > > Looks like patch 4 attemps to bypass usual permission checks with: > > @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir, > > if (error) > > return error; > > > > - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout && > > - !capable(CAP_MKNOD)) > > - return -EPERM; > > + /* > > + * In case of a device cgroup restirction allow mknod in user > > + * namespace. Otherwise just check global capability; thus, > > + * mknod is also disabled for user namespace other than the > > + * initial one. > > + */ > > + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) { > > + if (devcgroup_task_is_guarded(current)) { > > + if (!ns_capable(current_user_ns(), CAP_MKNOD)) > > + return -EPERM; > > + } else if (!capable(CAP_MKNOD)) > > + return -EPERM; > > + } > > > > Dear colleagues, > > > which pretty much sounds like authoritative LSM that was brought up in the past > > and LSM folks didn't like it. > > Thanks for pointing this out, Alexei! > I've searched through the LKML archives and found a thread about this: > https://lore.kernel.org/all/CAEf4BzaBt0W3sWh_L4RRXEFYdBotzVEnQdqC7BO+PNWtD7eSUA@xxxxxxxxxxxxxx/ > > As far as I understand, disagreement here is about a practice of > skipping kernel-built capability checks based > on LSM hooks, right? > > +CC Paul Moore <paul@xxxxxxxxxxxxxx> > > > > > If vfs folks are ok with this special bypass of permissions in vfs_mknod() > > we can talk about kernel->bpf api details. > > The way it's done with BPF_F_CGROUP_DEVICE_GUARD flag is definitely no go, > > but no point going into bpf details now until agreement on bypass is made. Afaiu the original concern was specifically about an LSM allowing to bypass other LSMs or DAC permissions. But this wouldn't be the case here. The general inode access LSM permission mediation is separate from specific device access management: the security_inode_permission() LSM hook would still be called and thus LSMs restrictions would continue to apply exactly as they do now. For cgroup v1 device access management was a cgroup controller with management interface through files. It then was ported to an eBPF program attachable to cgroups for cgroup v2. Arguably, it should probably have been ported to an LSM hook or a separate LSM and untied from cgroups completely. The confusion here seems to indicate that that would have been the right way to go. Because right now device access management seems its own form of mandatory access control.