On Mon, Sep 4, 2023 at 7:44 AM Christian Brauner <brauner@xxxxxxxxxx> wrote: > On Tue, Aug 29, 2023 at 03:35:46PM +0200, Alexander Mikhalitsyn wrote: > > On Fri, Aug 18, 2023 at 12:11 AM Alexei Starovoitov > > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > > On Tue, Aug 15, 2023 at 10:59:22AM +0200, Christian Brauner wrote: > > > > On Mon, Aug 14, 2023 at 04:26:09PM +0200, Michael Weiß wrote: > > > > > Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD > > > > > which allows to set a cgroup device program to be a device guard. > > > > > > > > Currently we block access to devices unconditionally in may_open_dev(). > > > > Anything that's mounted by an unprivileged containers will get > > > > SB_I_NODEV set in s_i_flags. > > > > > > > > Then we currently mediate device access in: > > > > > > > > * inode_permission() > > > > -> devcgroup_inode_permission() > > > > * vfs_mknod() > > > > -> devcgroup_inode_mknod() > > > > * blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends > > > > -> devcgroup_check_permission() > > > > * drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict > > > > -> devcgroup_check_permission() > > > > > > > > All your new flag does is to bypass that SB_I_NODEV check afaict and let > > > > it proceed to the devcgroup_*() checks for the vfs layer. > > > > > > > > But I don't get the semantics yet. > > > > Is that a flag which is set on BPF_PROG_TYPE_CGROUP_DEVICE programs or > > > > is that a flag on random bpf programs? It looks like it would be the > > > > latter but design-wise I would expect this to be a property of the > > > > device program itself. > > > > > > Looks like patch 4 attemps to bypass usual permission checks with: > > > @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir, > > > if (error) > > > return error; > > > > > > - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout && > > > - !capable(CAP_MKNOD)) > > > - return -EPERM; > > > + /* > > > + * In case of a device cgroup restirction allow mknod in user > > > + * namespace. Otherwise just check global capability; thus, > > > + * mknod is also disabled for user namespace other than the > > > + * initial one. > > > + */ > > > + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) { > > > + if (devcgroup_task_is_guarded(current)) { > > > + if (!ns_capable(current_user_ns(), CAP_MKNOD)) > > > + return -EPERM; > > > + } else if (!capable(CAP_MKNOD)) > > > + return -EPERM; > > > + } > > > > > > > Dear colleagues, > > > > > which pretty much sounds like authoritative LSM that was brought up in the past > > > and LSM folks didn't like it. > > > > Thanks for pointing this out, Alexei! > > I've searched through the LKML archives and found a thread about this: > > https://lore.kernel.org/all/CAEf4BzaBt0W3sWh_L4RRXEFYdBotzVEnQdqC7BO+PNWtD7eSUA@xxxxxxxxxxxxxx/ > > > > As far as I understand, disagreement here is about a practice of > > skipping kernel-built capability checks based > > on LSM hooks, right? > > > > +CC Paul Moore <paul@xxxxxxxxxxxxxx> > > > > > > > > If vfs folks are ok with this special bypass of permissions in vfs_mknod() > > > we can talk about kernel->bpf api details. > > > The way it's done with BPF_F_CGROUP_DEVICE_GUARD flag is definitely no go, > > > but no point going into bpf details now until agreement on bypass is made. > > Afaiu the original concern was specifically about an LSM allowing to > bypass other LSMs or DAC permissions. But this wouldn't be the case > here. The general inode access LSM permission mediation is separate from > specific device access management: the security_inode_permission() LSM > hook would still be called and thus LSMs restrictions would continue to > apply exactly as they do now. > > For cgroup v1 device access management was a cgroup controller with > management interface through files. It then was ported to an eBPF > program attachable to cgroups for cgroup v2. Arguably, it should > probably have been ported to an LSM hook or a separate LSM and untied > from cgroups completely. The confusion here seems to indicate that that > would have been the right way to go. > > Because right now device access management seems its own form of > mandatory access control. My apologies, I lost this thread in my inbox and I'm just reading it now. Historically we haven't any real LSM controls around cgroups/resource-management, but that is mostly because everything that I recall being proposed has been very piecemeal and didn't provide a comprehensive access control solution for resource management. If someone wanted to propose a proper set of access control hooks for cgroups I think that is something we would be happy to review, merge, etc. Somewhat relatedly, we've been working on some docs around guidelines for new LSM hooks; eventually the guidelines will work their way into the kernel docs, but here they are now: * https://github.com/LinuxSecurityModule/kernel/blob/main/README.md#new-lsm-hook-guidelines -- paul-moore.com