On Thu, May 18, 2023 at 02:54:42PM -0700, Andrii Nakryiko wrote: > Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall > forces users to specify pinning location as a string-based absolute or > relative (to current working directory) path. This has various > implications related to security (e.g., symlink-based attacks), forces > BPF FS to be exposed in the file system, which can cause races with > other applications. > > One of the feedbacks we got from folks working with containers heavily > was that inability to use purely FD-based location specification was an > unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET > commands. This patch closes this oversight, adding path_fd field to > BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by > *at() syscalls for dirfd + pathname combinations. > > This now allows interesting possibilities like working with detached BPF > FS mount (e.g., to perform multiple pinnings without running a risk of > someone interfering with them), and generally making pinning/getting > more secure and not prone to any races and/or security attacks. > > This is demonstrated by a selftest added in subsequent patch that takes > advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate > creating detached BPF FS mount, pinning, and then getting BPF map out of > it, all while never exposing this private instance of BPF FS to outside > worlds. > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > --- > include/linux/bpf.h | 4 ++-- > include/uapi/linux/bpf.h | 10 ++++++++++ > kernel/bpf/inode.c | 16 ++++++++-------- > kernel/bpf/syscall.c | 25 ++++++++++++++++++++----- > tools/include/uapi/linux/bpf.h | 10 ++++++++++ > 5 files changed, 50 insertions(+), 15 deletions(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 36e4b2d8cca2..f58895830ada 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd); > struct bpf_link *bpf_link_get_from_fd(u32 ufd); > struct bpf_link *bpf_link_get_curr_or_next(u32 *id); > > -int bpf_obj_pin_user(u32 ufd, const char __user *pathname); > -int bpf_obj_get_user(const char __user *pathname, int flags); > +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname); > +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags); > > #define BPF_ITER_FUNC_PREFIX "bpf_iter_" > #define DEFINE_BPF_ITER_FUNC(target, args...) \ > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 1bb11a6ee667..3731284671e4 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -1272,6 +1272,9 @@ enum { > > /* Create a map that will be registered/unregesitered by the backed bpf_link */ > BPF_F_LINK = (1U << 13), > + > +/* Get path from provided FD in BPF_OBJ_PIN/BPF_OBJ_GET commands */ > + BPF_F_PATH_FD = (1U << 14), > }; > > /* Flags for BPF_PROG_QUERY. */ > @@ -1420,6 +1423,13 @@ union bpf_attr { > __aligned_u64 pathname; > __u32 bpf_fd; > __u32 file_flags; > + /* Same as dirfd in openat() syscall; see openat(2) > + * manpage for details of path FD and pathname semantics; > + * path_fd should accompanied by BPF_F_PATH_FD flag set in > + * file_flags field, otherwise it should be set to zero; > + * if BPF_F_PATH_FD flag is not set, AT_FDCWD is assumed. > + */ > + __u32 path_fd; > }; Thanks for changing that. This is still odd though because you prevent users from specifying AT_FDCWD explicitly. They should be allowed to do that plus file descriptors are signed integers so please s/__u32/__s32/. AT_FDCWD should be passable anywhere where we have at* semantics. Plus, if in the vfs we ever add #define AT_ROOT -200 or something you can't use without coming up with your own custom flags. If you just follow what everyone else does and use __s32 then you're good. File descriptors really need to be signed. There's no way around that. See io_uring as a good example io_uring_sqe { __u8 opcode; /* type of operation for this sqe */ __u8 flags; /* IOSQE_ flags */ __u16 ioprio; /* ioprio for the request */ __s32 fd; /* file descriptor to do IO on */ } where the __s32 fd is used in all fd based requests including io_openat*() (See io_uring/openclose.c) which are effectively the semantics you want to emulate here.