On Wed, Nov 8, 2023 at 6:28 AM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > On Fri, Nov 03, 2023 at 12:05:09PM -0700, Andrii Nakryiko wrote: > > Add new kind of BPF kernel object, BPF token. BPF token is meant to > > allow delegating privileged BPF functionality, like loading a BPF > > program or creating a BPF map, from privileged process to a *trusted* > > unprivileged process, all while have a good amount of control over which > > privileged operations could be performed using provided BPF token. > > > > This is achieved through mounting BPF FS instance with extra delegation > > mount options, which determine what operations are delegatable, and also > > constraining it to the owning user namespace (as mentioned in the > > previous patch). > > > > BPF token itself is just a derivative from BPF FS and can be created > > through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts > > a path specification (using the usual fd + string path combo) to a BPF > > FS mount. Currently, BPF token "inherits" delegated command, map types, > > prog type, and attach type bit sets from BPF FS as is. In the future, > > having an BPF token as a separate object with its own FD, we can allow > > to further restrict BPF token's allowable set of things either at the creation > > time or after the fact, allowing the process to guard itself further > > from, e.g., unintentionally trying to load undesired kind of BPF > > programs. But for now we keep things simple and just copy bit sets as is. > > > > When BPF token is created from BPF FS mount, we take reference to the > > BPF super block's owning user namespace, and then use that namespace for > > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} > > capabilities that are normally only checked against init userns (using > > capable()), but now we check them using ns_capable() instead (if BPF > > token is provided). See bpf_token_capable() for details. > > > > Such setup means that BPF token in itself is not sufficient to grant BPF > > functionality. User namespaced process has to *also* have necessary > > combination of capabilities inside that user namespace. So while > > previously CAP_BPF was useless when granted within user namespace, now > > it gains a meaning and allows container managers and sys admins to have > > a flexible control over which processes can and need to use BPF > > functionality within the user namespace (i.e., container in practice). > > And BPF FS delegation mount options and derived BPF tokens serve as > > a per-container "flag" to grant overall ability to use bpf() (plus further > > restrict on which parts of bpf() syscalls are treated as namespaced). > > > > Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF) > > within the BPF FS owning user namespace, rounding up the ns_capable() > > story of BPF token. > > > > The alternative to creating BPF token object was: > > a) not having any extra object and just pasing BPF FS path to each > > relevant bpf() command. This seems suboptimal as it's racy (mount > > under the same path might change in between checking it and using it > > for bpf() command). And also less flexible if we'd like to further > > I don't understand "mount under the same path might change in between > checking it and using it for bpf() command". > > Just require userspace to open() the bpffs instance and pass that fd to > bpf() just as you're doing right now. If that is racy then the current > implementation is even more so because it is passing: > > bpffs_path_fd > bpffs_pathname > > and then performs a lookup. More on that below. Yes, this is a result of my initial confusion with how O_PATH-based open() works. You are right that it's not racy, I'll update the message. > > I want to point out that most of this code here is unnecessary if you > use the bpffs fd itself as a token. But that's your decision. I'm just > saying that I'm not sure the critique that it's racy is valid. Ack. > > > restrict ourselves compared to all the delegated functionality > > allowed on BPF FS. > > b) use non-bpf() interface, e.g., ioctl(), but otherwise also create > > a dedicated FD that would represent a token-like functionality. This > > doesn't seem superior to having a proper bpf() command, so > > BPF_TOKEN_CREATE was chosen. > > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > --- > > include/linux/bpf.h | 41 +++++++ > > include/uapi/linux/bpf.h | 39 +++++++ > > kernel/bpf/Makefile | 2 +- > > kernel/bpf/inode.c | 17 ++- > > kernel/bpf/syscall.c | 17 +++ > > kernel/bpf/token.c | 197 +++++++++++++++++++++++++++++++++ > > tools/include/uapi/linux/bpf.h | 39 +++++++ > > 7 files changed, 342 insertions(+), 10 deletions(-) > > create mode 100644 kernel/bpf/token.c > > [...] > > + > > +#define BPF_TOKEN_INODE_NAME "bpf-token" > > + > > +static const struct inode_operations bpf_token_iops = { }; > > + > > +static const struct file_operations bpf_token_fops = { > > + .release = bpf_token_release, > > + .show_fdinfo = bpf_token_show_fdinfo, > > +}; > > + > > +int bpf_token_create(union bpf_attr *attr) > > +{ > > + struct bpf_mount_opts *mnt_opts; > > + struct bpf_token *token = NULL; > > + struct user_namespace *userns; > > + struct inode *inode; > > + struct file *file; > > + struct path path; > > + umode_t mode; > > + int err, fd; > > + > > + err = user_path_at(attr->token_create.bpffs_path_fd, > > + u64_to_user_ptr(attr->token_create.bpffs_pathname), > > + LOOKUP_FOLLOW | LOOKUP_EMPTY, &path); > > Do you really need bpffs_path_fd and bpffs_pathname? > This seems unnecessar as you're forcing a lookup that's best done in > userspace through regular open() apis. So I would just make this: > > struct { /* struct used by BPF_TOKEN_CREATE command */ > __u32 flags; > __u32 bpffs_path_fd; > } token_create; > > In bpf_token_create() you can then just do: > > struct fd f; > struct path path; > > f = fdget(attr->token_create.bpffs_path_fd); > if (!f.file) > return -EBADF; > > *path = f.file->f_path; > path_get(path); > fdput(f); > Yes, you are right. I'll simplify this part, thanks. > > + if (err) > > + return err; > > + > > + if (path.mnt->mnt_root != path.dentry) { > > + err = -EINVAL; > > + goto out_path; > > + } > > + if (path.mnt->mnt_sb->s_op != &bpf_super_ops) { > > + err = -EINVAL; > > + goto out_path; > > + } [...]