Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object

Christian Brauner <brauner@xxxxxxxxxx> · Wed, 5 Jul 2023 10:45:50 +0200

On Wed, Jul 05, 2023 at 09:20:28AM +0200, Daniel Borkmann wrote:
> On 7/5/23 1:28 AM, Toke Høiland-Jørgensen wrote:
> > Christian Brauner <brauner@xxxxxxxxxx> writes:
> > > On Wed, Jun 28, 2023 at 10:18:19PM -0700, Andrii Nakryiko wrote:
> > > > Add new kind of BPF kernel object, BPF token. BPF token is meant to to
> > > > allow delegating privileged BPF functionality, like loading a BPF
> > > > program or creating a BPF map, from privileged process to a *trusted*
> > > > unprivileged process, all while have a good amount of control over which
> > > > privileged operations could be performed using provided BPF token.
> > > > 
> > > > This patch adds new BPF_TOKEN_CREATE command to bpf() syscall, which
> > > > allows to create a new BPF token object along with a set of allowed
> > > > commands that such BPF token allows to unprivileged applications.
> > > > Currently only BPF_TOKEN_CREATE command itself can be
> > > > delegated, but other patches gradually add ability to delegate
> > > > BPF_MAP_CREATE, BPF_BTF_LOAD, and BPF_PROG_LOAD commands.
> > > > 
> > > > The above means that new BPF tokens can be created using existing BPF
> > > > token, if original privileged creator allowed BPF_TOKEN_CREATE command.
> > > > New derived BPF token cannot be more powerful than the original BPF
> > > > token.
> > > > 
> > > > Importantly, BPF token is automatically pinned at the specified location
> > > > inside an instance of BPF FS and cannot be repinned using BPF_OBJ_PIN
> > > > command, unlike BPF prog/map/btf/link. This provides more control over
> > > > unintended sharing of BPF tokens through pinning it in another BPF FS
> > > > instances.
> > > > 
> > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > > > ---
> > > 
> > > The main issue I have with the token approach is that it is a completely
> > > separate delegation vector on top of user namespaces. We mentioned this
> > > duringthe conf and this was brought up on the thread here again as well.
> > > Imho, that's a problem both security-wise and complexity-wise.
> > > 
> > > It's not great if each subsystem gets its own custom delegation
> > > mechanism. This imposes such a taxing complexity on both kernel- and
> > > userspace that it will quickly become a huge liability. So I would
> > > really strongly encourage you to explore another direction.
> > 
> > I share this concern as well, but I'm not quite sure I follow your
> > proposal here. IIUC, you're saying that instead of creating the token
> > using a BPF_TOKEN_CREATE command, the policy daemon should create a
> > bpffs instance and attach the token value directly to that, right? But
> > then what? Are you proposing that the calling process inside the
> > container open a filesystem reference (how? using fspick()?) and pass
> > that to the bpf syscall? Or is there some way to find the right
> > filesystem instance to extract this from at the time that the bpf()
> > syscall is issued inside the container?
> 
> Given there can be multiple bpffs instances, it would have to be similar
> as to what Andrii did in that you need to pass the fd to the bpf(2) for
> prog/map creation in order to retrieve the opts->abilities from the super
> block.

I think it's pretty flexible what one can do here. Off the top of my
head there could be a dedicated file like /sys/fs/bpf/delegate which
only exists if delegation has been enabled. Thought that might be just a
wasted inode. There could be a new ioctl() on bpffsd which has the same
effect.

Probably an ioctl() on the bpffs instance is easier to grok. You could
even take away rights granted by a bpffs instance from such an fd via
additional ioctl() on it.

For increased limitations, it's also possible to have an optional
write-time security check from within the bpf call itself, e.g.,

    sys_bpf(fd_delegate)
    {
                struct fd fd = fdget_raw(fd_delegate);

                /* That token is only valid within a single user namespace ... */
                if (fd.file->f_cred->user_ns != current_user_ns())
                        return -EINVAL;

                /* woah, no CAP_BPF? */
                if (!ns_capable(fd.file->cred->user_ns, CAP_BPF))
                        return -EPERM;

                /* now check abilities */

                return 0;
    }

I'm not claiming that this is the silver bullet but it fits within the
framework of this approach and explicitly ties it into bpffs right from
the get go since this is the delegation mechanism's core.

The systemd-bpfd approach that was once pushed could probably also work
and I'm not up to date on why this was rejected. The issue against
systemd is still open.