Re: [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog

Michael Weiß <michael.weiss@xxxxxxxxxxxxxxxxxxx> · Mon, 11 Sep 2023 12:38:37 +0200

On 04.09.23 13:44, Christian Brauner wrote:
> On Tue, Aug 29, 2023 at 03:35:46PM +0200, Alexander Mikhalitsyn wrote:
>> On Fri, Aug 18, 2023 at 12:11 AM Alexei Starovoitov
>> <alexei.starovoitov@xxxxxxxxx> wrote:
>>>
>>> On Tue, Aug 15, 2023 at 10:59:22AM +0200, Christian Brauner wrote:
>>>> On Mon, Aug 14, 2023 at 04:26:09PM +0200, Michael Weiß wrote:
>>>>> Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
>>>>> which allows to set a cgroup device program to be a device guard.
>>>>
>>>> Currently we block access to devices unconditionally in may_open_dev().
>>>> Anything that's mounted by an unprivileged containers will get
>>>> SB_I_NODEV set in s_i_flags.
>>>>
>>>> Then we currently mediate device access in:
>>>>
>>>> * inode_permission()
>>>>   -> devcgroup_inode_permission()
>>>> * vfs_mknod()
>>>>   -> devcgroup_inode_mknod()
>>>> * blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends
>>>>   -> devcgroup_check_permission()
>>>> * drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict
>>>>   -> devcgroup_check_permission()
>>>>
>>>> All your new flag does is to bypass that SB_I_NODEV check afaict and let
>>>> it proceed to the devcgroup_*() checks for the vfs layer.
>>>>
>>>> But I don't get the semantics yet.
>>>> Is that a flag which is set on BPF_PROG_TYPE_CGROUP_DEVICE programs or
>>>> is that a flag on random bpf programs? It looks like it would be the
>>>> latter but design-wise I would expect this to be a property of the
>>>> device program itself.
>>>
>>> Looks like patch 4 attemps to bypass usual permission checks with:
>>> @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
>>>         if (error)
>>>                 return error;
>>>
>>> -       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
>>> -           !capable(CAP_MKNOD))
>>> -               return -EPERM;
>>> +       /*
>>> +        * In case of a device cgroup restirction allow mknod in user
>>> +        * namespace. Otherwise just check global capability; thus,
>>> +        * mknod is also disabled for user namespace other than the
>>> +        * initial one.
>>> +        */
>>> +       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) {
>>> +               if (devcgroup_task_is_guarded(current)) {
>>> +                       if (!ns_capable(current_user_ns(), CAP_MKNOD))
>>> +                               return -EPERM;
>>> +               } else if (!capable(CAP_MKNOD))
>>> +                       return -EPERM;
>>> +       }
>>>
>>
>> Dear colleagues,
>>
>>> which pretty much sounds like authoritative LSM that was brought up in the past
>>> and LSM folks didn't like it.
>>
>> Thanks for pointing this out, Alexei!
>> I've searched through the LKML archives and found a thread about this:
>> https://lore.kernel.org/all/CAEf4BzaBt0W3sWh_L4RRXEFYdBotzVEnQdqC7BO+PNWtD7eSUA@xxxxxxxxxxxxxx/
>>
>> As far as I understand, disagreement here is about a practice of
>> skipping kernel-built capability checks based
>> on LSM hooks, right?
>>
>> +CC Paul Moore <paul@xxxxxxxxxxxxxx>
>>
>>>
>>> If vfs folks are ok with this special bypass of permissions in vfs_mknod()
>>> we can talk about kernel->bpf api details.
>>> The way it's done with BPF_F_CGROUP_DEVICE_GUARD flag is definitely no go,
>>> but no point going into bpf details now until agreement on bypass is made.
> 
> Afaiu the original concern was specifically about an LSM allowing to
> bypass other LSMs or DAC permissions. But this wouldn't be the case
> here. The general inode access LSM permission mediation is separate from
> specific device access management: the security_inode_permission() LSM
> hook would still be called and thus LSMs restrictions would continue to
> apply exactly as they do now.

So are OK with the checks here?
> 
> For cgroup v1 device access management was a cgroup controller with
> management interface through files. It then was ported to an eBPF
> program attachable to cgroups for cgroup v2. Arguably, it should
> probably have been ported to an LSM hook or a separate LSM and untied
> from cgroups completely. The confusion here seems to indicate that that
> would have been the right way to go.
> 
> Because right now device access management seems its own form of
> mandatory access control.

I'm currently testing an updated version which has incorporated the locking
changes already mention by Alex and the change which avoids setting SB_I_NODEV
in fs/super.c.

Does is make sense to send out a v2 to further discuss BPF related changes?

Thnx,
Michael