Re: [PATCH v6 3/13] bpf: introduce BPF token object

Paul Moore <paul@xxxxxxxxxxxxxx> · Thu, 12 Oct 2023 19:18:50 -0400

On Wed, Oct 11, 2023 at 8:31 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
> On Tue, Oct 10, 2023 at 6:17 PM Paul Moore <paul@xxxxxxxxxxxxxx> wrote:
> > On Sep 27, 2023 Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> > >
> > > Add new kind of BPF kernel object, BPF token. BPF token is meant to
> > > allow delegating privileged BPF functionality, like loading a BPF
> > > program or creating a BPF map, from privileged process to a *trusted*
> > > unprivileged process, all while have a good amount of control over which
> > > privileged operations could be performed using provided BPF token.
> > >
> > > This is achieved through mounting BPF FS instance with extra delegation
> > > mount options, which determine what operations are delegatable, and also
> > > constraining it to the owning user namespace (as mentioned in the
> > > previous patch).
> > >
> > > BPF token itself is just a derivative from BPF FS and can be created
> > > through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts
> > > a path specification (using the usual fd + string path combo) to a BPF
> > > FS mount. Currently, BPF token "inherits" delegated command, map types,
> > > prog type, and attach type bit sets from BPF FS as is. In the future,
> > > having an BPF token as a separate object with its own FD, we can allow
> > > to further restrict BPF token's allowable set of things either at the creation
> > > time or after the fact, allowing the process to guard itself further
> > > from, e.g., unintentionally trying to load undesired kind of BPF
> > > programs. But for now we keep things simple and just copy bit sets as is.
> > >
> > > When BPF token is created from BPF FS mount, we take reference to the
> > > BPF super block's owning user namespace, and then use that namespace for
> > > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
> > > capabilities that are normally only checked against init userns (using
> > > capable()), but now we check them using ns_capable() instead (if BPF
> > > token is provided). See bpf_token_capable() for details.
> > >
> > > Such setup means that BPF token in itself is not sufficient to grant BPF
> > > functionality. User namespaced process has to *also* have necessary
> > > combination of capabilities inside that user namespace. So while
> > > previously CAP_BPF was useless when granted within user namespace, now
> > > it gains a meaning and allows container managers and sys admins to have
> > > a flexible control over which processes can and need to use BPF
> > > functionality within the user namespace (i.e., container in practice).
> > > And BPF FS delegation mount options and derived BPF tokens serve as
> > > a per-container "flag" to grant overall ability to use bpf() (plus further
> > > restrict on which parts of bpf() syscalls are treated as namespaced).
> > >
> > > The alternative to creating BPF token object was:
> > >   a) not having any extra object and just pasing BPF FS path to each
> > >      relevant bpf() command. This seems suboptimal as it's racy (mount
> > >      under the same path might change in between checking it and using it
> > >      for bpf() command). And also less flexible if we'd like to further
> > >      restrict ourselves compared to all the delegated functionality
> > >      allowed on BPF FS.
> > >   b) use non-bpf() interface, e.g., ioctl(), but otherwise also create
> > >      a dedicated FD that would represent a token-like functionality. This
> > >      doesn't seem superior to having a proper bpf() command, so
> > >      BPF_TOKEN_CREATE was chosen.
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > > ---
> > >  include/linux/bpf.h            |  40 +++++++
> > >  include/uapi/linux/bpf.h       |  39 +++++++
> > >  kernel/bpf/Makefile            |   2 +-
> > >  kernel/bpf/inode.c             |  10 +-
> > >  kernel/bpf/syscall.c           |  17 +++
> > >  kernel/bpf/token.c             | 197 +++++++++++++++++++++++++++++++++
> > >  tools/include/uapi/linux/bpf.h |  39 +++++++
> > >  7 files changed, 339 insertions(+), 5 deletions(-)
> > >  create mode 100644 kernel/bpf/token.c
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index a5bd40f71fd0..c43131a24579 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1572,6 +1576,13 @@ struct bpf_mount_opts {
> > >       u64 delegate_attachs;
> > >  };
> > >
> > > +struct bpf_token {
> > > +     struct work_struct work;
> > > +     atomic64_t refcnt;
> > > +     struct user_namespace *userns;
> > > +     u64 allowed_cmds;
> >
> > We'll also need a 'void *security' field to go along with the BPF token
> > allocation/creation/free hooks, see my comments below.  This is similar
> > to what we do for other kernel objects.
>
> ok, I'm thinking of adding a dedicated patch for all the
> security-related stuff and refactoring of existing LSM hook(s).

No objection here.  My main concern is that we get the LSM stuff in
the same patchset as the rest of the BPF token patches; if all of the
LSM bits are in a separate patch I'm not bothered.

Once we settle on the LSM hooks I'll draft a SELinux implementation of
the hooks which I'll hand off to you for inclusion in the patchset as
well.  I'd encourage the other LSMs that are interested to do the
same.

> > > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > +{
> > > +     /* BPF token allows ns_capable() level of capabilities */
> > > +     if (token) {
> >
> > I think we want a LSM hook here before the token is used in the
> > capability check.  The LSM will see the capability check, but it will
> > not be able to distinguish it from the process which created the
> > delegation token.  This is arguably the purpose of the delegation, but
> > with the LSM we want to be able to control who can use the delegated
> > privilege.  How about something like this:
> >
> >   if (security_bpf_token_capable(token, cap))
> >      return false;
>
> sounds good, I'll add this hook
>
> btw, I'm thinking of guarding the BPF_TOKEN_CREATE command behind the
> ns_capable(CAP_BPF) check, WDYT? This seems appropriate. You can get
> BPF token only if you have CAP_BPF **within the userns**, so any
> process not granted CAP_BPF within namespace ("container") is
> guaranteed to not be able to do anything with BPF token.

I don't recall seeing any capability checks guarding BPF_TOKEN_CREATE
in this revision so I don't think you're weakening the restrictions
any, and the logic above seems reasonable: if you don't have CAP_BPF
you shouldn't be creating a token.

> > > +static struct bpf_token *bpf_token_alloc(void)
> > > +{
> > > +     struct bpf_token *token;
> > > +
> > > +     token = kvzalloc(sizeof(*token), GFP_USER);
> > > +     if (!token)
> > > +             return NULL;
> > > +
> > > +     atomic64_set(&token->refcnt, 1);
> >
> > We should have a LSM hook here to allocate the LSM state associated
> > with the token.
> >
> >   if (security_bpf_token_alloc(token)) {
> >     kvfree(token);
> >     return NULL;
> >   }
> >
> > > +     return token;
> > > +}
> >
> > ...
> >
>
> Would having userns and allowed_* masks filled out by that time inside
> the token be useful (seems so if we treat bpf_token_alloc as generic
> LSM hook). If yes, I'll add security_bpf_token_alloc() after all that
> is filled out, right before we try to get unused fd. WDYT?

The security_bpf_token_alloc() hook isn't intended to do any access
control, it's just there so that the LSMs which need to allocate state
for the token object can do so at the same time the token is
allocated.  It has been my experience that allocating and releasing
the LSM state at the same time as the primary object's state is much
less fragile than disconnecting the two lifetimes and allocating the
LSM state later.

> > > +int bpf_token_create(union bpf_attr *attr)
> > > +{
> > > +     struct bpf_mount_opts *mnt_opts;
> > > +     struct bpf_token *token = NULL;
> > > +     struct inode *inode;
> > > +     struct file *file;
> > > +     struct path path;
> > > +     umode_t mode;
> > > +     int err, fd;
> > > +
> > > +     err = user_path_at(attr->token_create.bpffs_path_fd,
> > > +                        u64_to_user_ptr(attr->token_create.bpffs_pathname),
> > > +                        LOOKUP_FOLLOW | LOOKUP_EMPTY, &path);
> > > +     if (err)
> > > +             return err;
> > > +
> > > +     if (path.mnt->mnt_root != path.dentry) {
> > > +             err = -EINVAL;
> > > +             goto out_path;
> > > +     }
> > > +     err = path_permission(&path, MAY_ACCESS);
> > > +     if (err)
> > > +             goto out_path;
> > > +
> > > +     mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask());
> > > +     inode = bpf_get_inode(path.mnt->mnt_sb, NULL, mode);
> > > +     if (IS_ERR(inode)) {
> > > +             err = PTR_ERR(inode);
> > > +             goto out_path;
> > > +     }
> > > +
> > > +     inode->i_op = &bpf_token_iops;
> > > +     inode->i_fop = &bpf_token_fops;
> > > +     clear_nlink(inode); /* make sure it is unlinked */
> > > +
> > > +     file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
> > > +     if (IS_ERR(file)) {
> > > +             iput(inode);
> > > +             err = PTR_ERR(file);
> > > +             goto out_file;
> > > +     }
> > > +
> > > +     token = bpf_token_alloc();
> > > +     if (!token) {
> > > +             err = -ENOMEM;
> > > +             goto out_file;
> > > +     }
> > > +
> > > +     /* remember bpffs owning userns for future ns_capable() checks */
> > > +     token->userns = get_user_ns(path.dentry->d_sb->s_user_ns);
> > > +
> > > +     mnt_opts = path.dentry->d_sb->s_fs_info;
> > > +     token->allowed_cmds = mnt_opts->delegate_cmds;
> >
> > I think we would want a LSM hook here, both to control the creation
> > of the token and mark it with the security attributes of the creating
> > process.  How about something like this:
> >
> >   err = security_bpf_token_create(token);
> >   if (err)
> >     goto out_token;
>
> hmm... so you'd like both security_bpf_token_alloc() and
> security_bpf_token_create()? They seem almost identical, do we need
> two? Or is it that the security_bpf_token_alloc() is supposed to be
> only used to create those `void *security` context pieces, while
> security_bpf_token_create() is actually going to be used for
> enforcement?

I tried to explain a bit in my comment above, but the alloc hook
basically just does the LSM state allocation whereas the token_create
hook does the access control and setting of the LSM state associated
with the token.

If you want to get rid of the bpf_token_alloc() function and fold it
into the bpf_token_create() function then we can go down to one LSM
hook that covers state allocation, initialization, and access control.
However, if it remains possible to allocate a token object outside of
bpf_token_create() I think it is a good idea to keep the dedicated LSM
allocation hooks.

You can apply the same logic to the LSM token state destructor hook.

> For my own education, is there some explicit flag or some
> other sort of mark between LSM hooks for setting up security vs
> enforcement? Or is it mostly based on convention and implicitly
> following the split?

Generally convention based around trying to match the LSM state
lifetime with the lifetime of the associated object; as I mentioned
earlier, we've had problems in the past when the two differ.  If you
look at all of the LSM hooks you'll see a number of
"security_XXXX_alloc()" hooks for things like superblocks, inodes,
files, task structs, creds, and the like; we're just doing the same
things here with BPF tokens.  If you're still not convinced, it may be
worth noting that we currently have security_bpf_map_alloc() and
security_bpf_prog_alloc() hooks.

> > > +     fd = get_unused_fd_flags(O_CLOEXEC);
> > > +     if (fd < 0) {
> > > +             err = fd;
> > > +             goto out_token;
> > > +     }
> > > +
> > > +     file->private_data = token;
> > > +     fd_install(fd, file);
> > > +
> > > +     path_put(&path);
> > > +     return fd;
> > > +
> > > +out_token:
> > > +     bpf_token_free(token);
> > > +out_file:
> > > +     fput(file);
> > > +out_path:
> > > +     path_put(&path);
> > > +     return err;
> > > +}
> >
> > ...
> >
> > > +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
> > > +{
> > > +     if (!token)
> > > +             return false;
> > > +
> > > +     return token->allowed_cmds & (1ULL << cmd);
> >
> > Similar to bpf_token_capable(), I believe we want a LSM hook here to
> > control who is allowed to use the delegated privilege.
> >
> >   bool bpf_token_allow_cmd(...)
> >   {
> >     if (token && (token->allowed_cmds & (1ULL << cmd))
> >       return security_bpf_token_cmd(token, cmd);
>
> ok, so I guess I'll have to add all four variants:
> security_bpf_token_{cmd,map_type,prog_type,attach_type}, right?

Not necessarily.  Currently only SELinux provides a set of LSM BPF
access controls, and it only concerns itself with BPF maps and
programs.  From a map perspective it comes down to controlling which
applications can create a map, or use a map created elsewhere (we
label BPF map objects just as we would any other kernel object).  From
a program perspective, it is about loading programs and executing
them.  While not BPF specific, SELinux also provides controls that
restrict the ability to exercise capabilities.

Keeping the above access controls in mind, I'm hopeful you can better
understand the reasoning behind the hooks I'm proposing.  The
security_bpf_token_cmd() hook is there so that we can control a
token's ability to authorize BPF_MAP_CREATE and BPF_PROG_LOAD.  The
security_bpf_token_capable() hook is there so that we can control a
token's ability to authorize the BPF-related capabilities.  While I
guess it's possible someone might develop a security model for BPF
that would require the other LSM hooks you've mentioned above, but I
see no need for that now, and to be honest I don't see any need on the
visible horizon either.

Does that make sense?

--
paul-moore.com