On 04.09.23 13:44, Christian Brauner wrote: > On Tue, Aug 29, 2023 at 03:35:46PM +0200, Alexander Mikhalitsyn wrote: >> On Fri, Aug 18, 2023 at 12:11 AM Alexei Starovoitov >> <alexei.starovoitov@xxxxxxxxx> wrote: >>> >>> On Tue, Aug 15, 2023 at 10:59:22AM +0200, Christian Brauner wrote: >>>> On Mon, Aug 14, 2023 at 04:26:09PM +0200, Michael Weiß wrote: >>>>> Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD >>>>> which allows to set a cgroup device program to be a device guard. >>>> >>>> Currently we block access to devices unconditionally in may_open_dev(). >>>> Anything that's mounted by an unprivileged containers will get >>>> SB_I_NODEV set in s_i_flags. >>>> >>>> Then we currently mediate device access in: >>>> >>>> * inode_permission() >>>> -> devcgroup_inode_permission() >>>> * vfs_mknod() >>>> -> devcgroup_inode_mknod() >>>> * blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends >>>> -> devcgroup_check_permission() >>>> * drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict >>>> -> devcgroup_check_permission() >>>> >>>> All your new flag does is to bypass that SB_I_NODEV check afaict and let >>>> it proceed to the devcgroup_*() checks for the vfs layer. >>>> >>>> But I don't get the semantics yet. >>>> Is that a flag which is set on BPF_PROG_TYPE_CGROUP_DEVICE programs or >>>> is that a flag on random bpf programs? It looks like it would be the >>>> latter but design-wise I would expect this to be a property of the >>>> device program itself. >>> >>> Looks like patch 4 attemps to bypass usual permission checks with: >>> @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir, >>> if (error) >>> return error; >>> >>> - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout && >>> - !capable(CAP_MKNOD)) >>> - return -EPERM; >>> + /* >>> + * In case of a device cgroup restirction allow mknod in user >>> + * namespace. Otherwise just check global capability; thus, >>> + * mknod is also disabled for user namespace other than the >>> + * initial one. >>> + */ >>> + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) { >>> + if (devcgroup_task_is_guarded(current)) { >>> + if (!ns_capable(current_user_ns(), CAP_MKNOD)) >>> + return -EPERM; >>> + } else if (!capable(CAP_MKNOD)) >>> + return -EPERM; >>> + } >>> >> >> Dear colleagues, >> >>> which pretty much sounds like authoritative LSM that was brought up in the past >>> and LSM folks didn't like it. >> >> Thanks for pointing this out, Alexei! >> I've searched through the LKML archives and found a thread about this: >> https://lore.kernel.org/all/CAEf4BzaBt0W3sWh_L4RRXEFYdBotzVEnQdqC7BO+PNWtD7eSUA@xxxxxxxxxxxxxx/ >> >> As far as I understand, disagreement here is about a practice of >> skipping kernel-built capability checks based >> on LSM hooks, right? >> >> +CC Paul Moore <paul@xxxxxxxxxxxxxx> >> >>> >>> If vfs folks are ok with this special bypass of permissions in vfs_mknod() >>> we can talk about kernel->bpf api details. >>> The way it's done with BPF_F_CGROUP_DEVICE_GUARD flag is definitely no go, >>> but no point going into bpf details now until agreement on bypass is made. > > Afaiu the original concern was specifically about an LSM allowing to > bypass other LSMs or DAC permissions. But this wouldn't be the case > here. The general inode access LSM permission mediation is separate from > specific device access management: the security_inode_permission() LSM > hook would still be called and thus LSMs restrictions would continue to > apply exactly as they do now. So are OK with the checks here? > > For cgroup v1 device access management was a cgroup controller with > management interface through files. It then was ported to an eBPF > program attachable to cgroups for cgroup v2. Arguably, it should > probably have been ported to an LSM hook or a separate LSM and untied > from cgroups completely. The confusion here seems to indicate that that > would have been the right way to go. > > Because right now device access management seems its own form of > mandatory access control. I'm currently testing an updated version which has incorporated the locking changes already mention by Alex and the change which avoids setting SB_I_NODEV in fs/super.c. Does is make sense to send out a v2 to further discuss BPF related changes? Thnx, Michael