On Tue, Sep 26, 2023 at 9:21 AM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > On Tue, Sep 19, 2023 at 02:47:50PM -0700, Andrii Nakryiko wrote: > > Add new kind of BPF kernel object, BPF token. BPF token is meant to > > allow delegating privileged BPF functionality, like loading a BPF > > program or creating a BPF map, from privileged process to a *trusted* > > unprivileged process, all while have a good amount of control over which > > privileged operations could be performed using provided BPF token. > > > > This is achieved through mounting BPF FS instance with extra delegation > > mount options, which determine what operations are delegatable, and also > > constraining it to the owning user namespace (as mentioned in the > > previous patch). > > > > BPF token itself is just a derivative from BPF FS and can be created > > through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts > > a path specification (using the usual fd + string path combo) to a BPF > > FS mount. Currently, BPF token "inherits" delegated command, map types, > > prog type, and attach type bit sets from BPF FS as is. In the future, > > having an BPF token as a separate object with its own FD, we can allow > > to further restrict BPF token's allowable set of things either at the creation > > time or after the fact, allowing the process to guard itself further > > from, e.g., unintentionally trying to load undesired kind of BPF > > programs. But for now we keep things simple and just copy bit sets as is. > > > > When BPF token is created from BPF FS mount, we take reference to the > > BPF super block's owning user namespace, and then use that namespace for > > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} > > capabilities that are normally only checked against init userns (using > > capable()), but now we check them using ns_capable() instead (if BPF > > token is provided). See bpf_token_capable() for details. > > > > Such setup means that BPF token in itself is not sufficient to grant BPF > > functionality. User namespaced process has to *also* have necessary > > combination of capabilities inside that user namespace. So while > > previously CAP_BPF was useless when granted within user namespace, now > > it gains a meaning and allows container managers and sys admins to have > > a flexible control over which processes can and need to use BPF > > functionality within the user namespace (i.e., container in practice). > > And BPF FS delegation mount options and derived BPF tokens serve as > > a per-container "flag" to grant overall ability to use bpf() (plus further > > restrict on which parts of bpf() syscalls are treated as namespaced). > > > > The alternative to creating BPF token object was: > > a) not having any extra object and just pasing BPF FS path to each > > relevant bpf() command. This seems suboptimal as it's racy (mount > > under the same path might change in between checking it and using it > > for bpf() command). And also less flexible if we'd like to further > > restrict ourselves compared to all the delegated functionality > > allowed on BPF FS. > > b) use non-bpf() interface, e.g., ioctl(), but otherwise also create > > a dedicated FD that would represent a token-like functionality. This > > doesn't seem superior to having a proper bpf() command, so > > BPF_TOKEN_CREATE was chosen. > > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > --- > > include/linux/bpf.h | 42 ++++++++ > > include/uapi/linux/bpf.h | 39 +++++++ > > kernel/bpf/Makefile | 2 +- > > kernel/bpf/inode.c | 4 +- > > kernel/bpf/syscall.c | 17 +++ > > kernel/bpf/token.c | 189 +++++++++++++++++++++++++++++++++ > > tools/include/uapi/linux/bpf.h | 39 +++++++ > > 7 files changed, 330 insertions(+), 2 deletions(-) > > create mode 100644 kernel/bpf/token.c > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > index 026923a60cad..ae13538f5465 100644 > > --- a/include/linux/bpf.h > > +++ b/include/linux/bpf.h > > @@ -51,6 +51,8 @@ struct module; > > struct bpf_func_state; > > struct ftrace_ops; > > struct cgroup; > > +struct bpf_token; > > +struct user_namespace; > > > > extern struct idr btf_idr; > > extern spinlock_t btf_idr_lock; > > @@ -1572,6 +1574,13 @@ struct bpf_mount_opts { > > u64 delegate_attachs; > > }; > > > > +struct bpf_token { > > + struct work_struct work; > > + atomic64_t refcnt; > > + struct user_namespace *userns; > > + u64 allowed_cmds; > > +}; > > + > > struct bpf_struct_ops_value; > > struct btf_member; > > > > @@ -2162,6 +2171,8 @@ static inline void bpf_map_dec_elem_count(struct bpf_map *map) > > > > extern int sysctl_unprivileged_bpf_disabled; > > > > +bool bpf_token_capable(const struct bpf_token *token, int cap); > > + > > static inline bool bpf_allow_ptr_leaks(void) > > { > > return perfmon_capable(); > > @@ -2196,6 +2207,14 @@ int bpf_link_new_fd(struct bpf_link *link); > > struct bpf_link *bpf_link_get_from_fd(u32 ufd); > > struct bpf_link *bpf_link_get_curr_or_next(u32 *id); > > > > +void bpf_token_inc(struct bpf_token *token); > > +void bpf_token_put(struct bpf_token *token); > > +int bpf_token_create(union bpf_attr *attr); > > +int bpf_token_new_fd(struct bpf_token *token); > > +struct bpf_token *bpf_token_get_from_fd(u32 ufd); > > + > > +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd); > > + > > int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname); > > int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags); > > > > @@ -2557,6 +2576,29 @@ static inline int bpf_obj_get_user(const char __user *pathname, int flags) > > return -EOPNOTSUPP; > > } > > > > +static inline bool bpf_token_capable(const struct bpf_token *token, int cap) > > +{ > > + return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN)); > > +} > > + > > +static inline void bpf_token_inc(struct bpf_token *token) > > +{ > > +} > > + > > +static inline void bpf_token_put(struct bpf_token *token) > > +{ > > +} > > + > > +static inline int bpf_token_new_fd(struct bpf_token *token) > > +{ > > + return -EOPNOTSUPP; > > +} > > + > > +static inline struct bpf_token *bpf_token_get_from_fd(u32 ufd) > > +{ > > + return ERR_PTR(-EOPNOTSUPP); > > +} > > + > > static inline void __dev_flush(void) > > { > > } > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index 73b155e52204..36e98c6f8944 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -847,6 +847,37 @@ union bpf_iter_link_info { > > * Returns zero on success. On error, -1 is returned and *errno* > > * is set appropriately. > > * > > + * BPF_TOKEN_CREATE > > + * Description > > + * Create BPF token with embedded information about what > > + * BPF-related functionality it allows: > > + * - a set of allowed bpf() syscall commands; > > + * - a set of allowed BPF map types to be created with > > + * BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed; > > + * - a set of allowed BPF program types and BPF program attach > > + * types to be loaded with BPF_PROG_LOAD command, if > > + * BPF_PROG_LOAD itself is allowed. > > + * > > + * BPF token is created (derived) from an instance of BPF FS, > > + * assuming it has necessary delegation mount options specified. > > + * BPF FS mount is specified with openat()-style path FD + string. > > + * This BPF token can be passed as an extra parameter to various > > + * bpf() syscall commands to grant BPF subsystem functionality to > > + * unprivileged processes. > > + * > > + * When created, BPF token is "associated" with the owning > > + * user namespace of BPF FS instance (super block) that it was > > + * derived from, and subsequent BPF operations performed with > > + * BPF token would be performing capabilities checks (i.e., > > + * CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within > > + * that user namespace. Without BPF token, such capabilities > > + * have to be granted in init user namespace, making bpf() > > + * syscall incompatible with user namespace, for the most part. > > + * > > + * Return > > + * A new file descriptor (a nonnegative integer), or -1 if an > > + * error occurred (in which case, *errno* is set appropriately). > > + * > > * NOTES > > * eBPF objects (maps and programs) can be shared between processes. > > * > > @@ -901,6 +932,8 @@ enum bpf_cmd { > > BPF_ITER_CREATE, > > BPF_LINK_DETACH, > > BPF_PROG_BIND_MAP, > > + BPF_TOKEN_CREATE, > > + __MAX_BPF_CMD, > > }; > > > > enum bpf_map_type { > > @@ -1694,6 +1727,12 @@ union bpf_attr { > > __u32 flags; /* extra flags */ > > } prog_bind_map; > > > > + struct { /* struct used by BPF_TOKEN_CREATE command */ > > + __u32 flags; > > + __u32 bpffs_path_fd; > > + __u64 bpffs_pathname; > > + } token_create; > > + > > } __attribute__((aligned(8))); > > > > /* The description below is an attempt at providing documentation to eBPF > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > > index f526b7573e97..4ce95acfcaa7 100644 > > --- a/kernel/bpf/Makefile > > +++ b/kernel/bpf/Makefile > > @@ -6,7 +6,7 @@ cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse > > endif > > CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy) > > > > -obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o > > +obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o > > obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o > > obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o > > obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o > > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c > > index 8f66b57d3546..82f11fbffd3e 100644 > > --- a/kernel/bpf/inode.c > > +++ b/kernel/bpf/inode.c > > @@ -603,11 +603,13 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root) > > { > > struct bpf_mount_opts *opts = root->d_sb->s_fs_info; > > umode_t mode = d_inode(root)->i_mode & S_IALLUGO & ~S_ISVTX; > > + u64 mask; > > > > if (mode != S_IRWXUGO) > > seq_printf(m, ",mode=%o", mode); > > > > - if (opts->delegate_cmds == ~0ULL) > > + mask = (1ULL << __MAX_BPF_CMD) - 1; > > + if ((opts->delegate_cmds & mask) == mask) > > seq_printf(m, ",delegate_cmds=any"); > > else if (opts->delegate_cmds) > > seq_printf(m, ",delegate_cmds=0x%llx", opts->delegate_cmds); > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > index f024caee0bba..93338faa43d5 100644 > > --- a/kernel/bpf/syscall.c > > +++ b/kernel/bpf/syscall.c > > @@ -5302,6 +5302,20 @@ static int bpf_prog_bind_map(union bpf_attr *attr) > > return ret; > > } > > > > +#define BPF_TOKEN_CREATE_LAST_FIELD token_create.bpffs_pathname > > + > > +static int token_create(union bpf_attr *attr) > > +{ > > + if (CHECK_ATTR(BPF_TOKEN_CREATE)) > > + return -EINVAL; > > + > > + /* no flags are supported yet */ > > + if (attr->token_create.flags) > > + return -EINVAL; > > + > > + return bpf_token_create(attr); > > +} > > + > > static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size) > > { > > union bpf_attr attr; > > @@ -5435,6 +5449,9 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size) > > case BPF_PROG_BIND_MAP: > > err = bpf_prog_bind_map(&attr); > > break; > > + case BPF_TOKEN_CREATE: > > + err = token_create(&attr); > > + break; > > default: > > err = -EINVAL; > > break; > > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c > > new file mode 100644 > > index 000000000000..f6ea3eddbee6 > > --- /dev/null > > +++ b/kernel/bpf/token.c > > @@ -0,0 +1,189 @@ > > +#include <linux/bpf.h> > > +#include <linux/vmalloc.h> > > +#include <linux/anon_inodes.h> > > +#include <linux/fdtable.h> > > +#include <linux/file.h> > > +#include <linux/fs.h> > > +#include <linux/kernel.h> > > +#include <linux/idr.h> > > +#include <linux/namei.h> > > +#include <linux/user_namespace.h> > > + > > +bool bpf_token_capable(const struct bpf_token *token, int cap) > > +{ > > + /* BPF token allows ns_capable() level of capabilities */ > > + if (token) { > > + if (ns_capable(token->userns, cap)) > > + return true; > > + if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN)) > > + return true; > > + } > > + /* otherwise fallback to capable() checks */ > > + return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN)); > > +} > > + > > +void bpf_token_inc(struct bpf_token *token) > > +{ > > + atomic64_inc(&token->refcnt); > > +} > > + > > +static void bpf_token_free(struct bpf_token *token) > > +{ > > + put_user_ns(token->userns); > > + kvfree(token); > > +} > > + > > +static void bpf_token_put_deferred(struct work_struct *work) > > +{ > > + struct bpf_token *token = container_of(work, struct bpf_token, work); > > + > > + bpf_token_free(token); > > +} > > + > > +void bpf_token_put(struct bpf_token *token) > > +{ > > + if (!token) > > + return; > > + > > + if (!atomic64_dec_and_test(&token->refcnt)) > > + return; > > + > > + INIT_WORK(&token->work, bpf_token_put_deferred); > > + schedule_work(&token->work); > > +} > > + > > +static int bpf_token_release(struct inode *inode, struct file *filp) > > +{ > > + struct bpf_token *token = filp->private_data; > > + > > + bpf_token_put(token); > > + return 0; > > +} > > + > > +static ssize_t bpf_dummy_read(struct file *filp, char __user *buf, size_t siz, > > + loff_t *ppos) > > +{ > > + /* We need this handler such that alloc_file() enables > > + * f_mode with FMODE_CAN_READ. > > + */ > > + return -EINVAL; > > +} > > + > > +static ssize_t bpf_dummy_write(struct file *filp, const char __user *buf, > > + size_t siz, loff_t *ppos) > > +{ > > + /* We need this handler such that alloc_file() enables > > + * f_mode with FMODE_CAN_WRITE. > > + */ > > + return -EINVAL; > > +} > > + > > +static void bpf_token_show_fdinfo(struct seq_file *m, struct file *filp) > > +{ > > + struct bpf_token *token = filp->private_data; > > + u64 mask; > > + > > + mask = (1ULL << __MAX_BPF_CMD) - 1; > > + if ((token->allowed_cmds & mask) == mask) > > + seq_printf(m, "allowed_cmds:\tany\n"); > > + else > > + seq_printf(m, "allowed_cmds:\t0x%llx\n", token->allowed_cmds); > > +} > > + > > +static const struct file_operations bpf_token_fops = { > > + .release = bpf_token_release, > > + .read = bpf_dummy_read, > > + .write = bpf_dummy_write, > > + .show_fdinfo = bpf_token_show_fdinfo, > > +}; > > + > > +static struct bpf_token *bpf_token_alloc(void) > > +{ > > + struct bpf_token *token; > > + > > + token = kvzalloc(sizeof(*token), GFP_USER); > > + if (!token) > > + return NULL; > > + > > + atomic64_set(&token->refcnt, 1); > > + > > + return token; > > +} > > + > > +int bpf_token_create(union bpf_attr *attr) > > +{ > > + struct path path; > > + struct bpf_mount_opts *mnt_opts; > > + struct bpf_token *token; > > + int ret; > > + > > + ret = user_path_at(attr->token_create.bpffs_path_fd, > > + u64_to_user_ptr(attr->token_create.bpffs_pathname), > > + LOOKUP_FOLLOW | LOOKUP_EMPTY, &path); > > + if (ret) > > + return ret; > > + > > + if (path.mnt->mnt_root != path.dentry) { > > + ret = -EINVAL; > > + goto out; > > + } > > + ret = path_permission(&path, MAY_ACCESS); > > + if (ret) > > + goto out; > > + > > + token = bpf_token_alloc(); > > + if (!token) { > > + ret = -ENOMEM; > > + goto out; > > + } > > + > > + /* remember bpffs owning userns for future ns_capable() checks */ > > + token->userns = get_user_ns(path.dentry->d_sb->s_user_ns); > > + > > + mnt_opts = path.dentry->d_sb->s_fs_info; > > + token->allowed_cmds = mnt_opts->delegate_cmds; > > + > > + ret = bpf_token_new_fd(token); > > + if (ret < 0) > > + bpf_token_free(token); > > +out: > > + path_put(&path); > > + return ret; > > +} > > + > > +#define BPF_TOKEN_INODE_NAME "bpf-token" > > + > > +/* Alloc anon_inode and FD for prepared token. > > + * Returns fd >= 0 on success; negative error, otherwise. > > + */ > > +int bpf_token_new_fd(struct bpf_token *token) > > +{ > > + return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC); > > It's unnecessary to use the anonymous inode infrastructure for bpf > tokens. It adds even more moving parts and makes reasoning about it even > harder. Just keep it all in bpffs. IIRC, something like the following > (broken, non-compiling draft) should work: > > /* bpf_token_file - get an unlinked file living in bpffs */ > struct file *bpf_token_file(...) > { > inode = bpf_get_inode(bpffs_mnt->mnt_sb, dir, mode); > inode->i_op = &bpf_token_iop; > inode->i_fop = &bpf_token_fops; > > // some other stuff you might want or need > > res = alloc_file_pseudo(inode, bpffs_mnt, "bpf-token", O_RDWR, &bpf_token_fops); > } > > Now set your private data that you might need, reserve an fd, install > the file into the fdtable and return the fd. You should have an unlinked > bpffs file that serves as your bpf token. Just to make sure I understand. You are saying that instead of having `struct bpf_token *` and passing that into internal APIs (bpf_token_capable() and bpf_token_allow_xxx()), I should just pass around `struct super_block *` representing BPF FS instance? Or `struct bpf_mount_opts *` maybe? Or 'struct vfsmount *'? (Any preferences here?). Is that right? The point is not to have a struct bpf_token that keeps its own refcount, doesn't maintain its own allowed_xxx masks, and doesn't keep a refcnt on userns. Should I worry about refcounting of the super_block? It was a nice property that I could store bpf_token inside the program for some future checks that could happen during attach time after the BPF program is loaded and verified. How do I achieve the same if I need a super_block around? Is there some get/put-like APIs for super_block (or vfsmount?) that I can use for that? I'm sorry if it's stupid questions, just trying to cover all the ground before I reimplement portions of this patch set again. Thanks for understanding!