Re: [PATCH v2 bpf-next 1/3] bpf: support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands

Christian Brauner <brauner@xxxxxxxxxx> · Fri, 19 May 2023 11:49:45 +0200

On Thu, May 18, 2023 at 02:54:42PM -0700, Andrii Nakryiko wrote:
> Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
> forces users to specify pinning location as a string-based absolute or
> relative (to current working directory) path. This has various
> implications related to security (e.g., symlink-based attacks), forces
> BPF FS to be exposed in the file system, which can cause races with
> other applications.
> 
> One of the feedbacks we got from folks working with containers heavily
> was that inability to use purely FD-based location specification was an
> unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
> commands. This patch closes this oversight, adding path_fd field to
> BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
> *at() syscalls for dirfd + pathname combinations.
> 
> This now allows interesting possibilities like working with detached BPF
> FS mount (e.g., to perform multiple pinnings without running a risk of
> someone interfering with them), and generally making pinning/getting
> more secure and not prone to any races and/or security attacks.
> 
> This is demonstrated by a selftest added in subsequent patch that takes
> advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
> creating detached BPF FS mount, pinning, and then getting BPF map out of
> it, all while never exposing this private instance of BPF FS to outside
> worlds.
> 
> Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> ---
>  include/linux/bpf.h            |  4 ++--
>  include/uapi/linux/bpf.h       | 10 ++++++++++
>  kernel/bpf/inode.c             | 16 ++++++++--------
>  kernel/bpf/syscall.c           | 25 ++++++++++++++++++++-----
>  tools/include/uapi/linux/bpf.h | 10 ++++++++++
>  5 files changed, 50 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 36e4b2d8cca2..f58895830ada 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
>  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
>  struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
>  
> -int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> -int bpf_obj_get_user(const char __user *pathname, int flags);
> +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
> +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
>  
>  #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
>  #define DEFINE_BPF_ITER_FUNC(target, args...)			\
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 1bb11a6ee667..3731284671e4 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1272,6 +1272,9 @@ enum {
>  
>  /* Create a map that will be registered/unregesitered by the backed bpf_link */
>  	BPF_F_LINK		= (1U << 13),
> +
> +/* Get path from provided FD in BPF_OBJ_PIN/BPF_OBJ_GET commands */
> +	BPF_F_PATH_FD		= (1U << 14),
>  };
>  
>  /* Flags for BPF_PROG_QUERY. */
> @@ -1420,6 +1423,13 @@ union bpf_attr {
>  		__aligned_u64	pathname;
>  		__u32		bpf_fd;
>  		__u32		file_flags;
> +		/* Same as dirfd in openat() syscall; see openat(2)
> +		 * manpage for details of path FD and pathname semantics;
> +		 * path_fd should accompanied by BPF_F_PATH_FD flag set in
> +		 * file_flags field, otherwise it should be set to zero;
> +		 * if BPF_F_PATH_FD flag is not set, AT_FDCWD is assumed.
> +		 */
> +		__u32		path_fd;
>  	};

Thanks for changing that.

This is still odd though because you prevent users from specifying
AT_FDCWD explicitly. They should be allowed to do that plus file
descriptors are signed integers so please s/__u32/__s32/. AT_FDCWD
should be passable anywhere where we have at* semantics. Plus, if in the
vfs we ever add
#define AT_ROOT -200
or something you can't use without coming up with your own custom flags.
If you just follow what everyone else does and use __s32 then you're
good.

File descriptors really need to be signed. There's no way around that.
See io_uring as a good example

io_uring_sqe {
          __u8    opcode;         /* type of operation for this sqe */
          __u8    flags;          /* IOSQE_ flags */
          __u16   ioprio;         /* ioprio for the request */
          __s32   fd;             /* file descriptor to do IO on */
}

where the __s32 fd is used in all fd based requests including
io_openat*() (See io_uring/openclose.c) which are effectively the
semantics you want to emulate here.