fd == 0 means AT_FDCWD BPF_OBJ_GET commands

Christian Brauner <brauner@xxxxxxxxxx> · Wed, 17 May 2023 11:11:24 +0200

On Tue, May 16, 2023 at 11:02:42AM -0700, Andrii Nakryiko wrote:
> On Tue, May 16, 2023 at 2:07 AM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> >
> > On Mon, May 15, 2023 at 05:13:46PM -0700, Andrii Nakryiko wrote:
> > > Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
> > > forces users to specify pinning location as a string-based absolute or
> > > relative (to current working directory) path. This has various
> > > implications related to security (e.g., symlink-based attacks), forces
> > > BPF FS to be exposed in the file system, which can cause races with
> > > other applications.
> > >
> > > One of the feedbacks we got from folks working with containers heavily
> > > was that inability to use purely FD-based location specification was an
> > > unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
> > > commands. This patch closes this oversight, adding path_fd field to
> >
> > Cool!
> >
> > > BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
> > > *at() syscalls for dirfd + pathname combinations.
> > >
> > > This now allows interesting possibilities like working with detached BPF
> > > FS mount (e.g., to perform multiple pinnings without running a risk of
> > > someone interfering with them), and generally making pinning/getting
> > > more secure and not prone to any races and/or security attacks.
> > >
> > > This is demonstrated by a selftest added in subsequent patch that takes
> > > advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
> > > creating detached BPF FS mount, pinning, and then getting BPF map out of
> > > it, all while never exposing this private instance of BPF FS to outside
> > > worlds.
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > > ---
> > >  include/linux/bpf.h            |  4 ++--
> > >  include/uapi/linux/bpf.h       |  5 +++++
> > >  kernel/bpf/inode.c             | 16 ++++++++--------
> > >  kernel/bpf/syscall.c           |  8 +++++---
> > >  tools/include/uapi/linux/bpf.h |  5 +++++
> > >  5 files changed, 25 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 36e4b2d8cca2..f58895830ada 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
> > >  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> > >  struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
> > >
> > > -int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> > > -int bpf_obj_get_user(const char __user *pathname, int flags);
> > > +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
> > > +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
> > >
> > >  #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
> > >  #define DEFINE_BPF_ITER_FUNC(target, args...)                        \
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 1bb11a6ee667..db2870a52ce0 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -1420,6 +1420,11 @@ union bpf_attr {
> > >               __aligned_u64   pathname;
> > >               __u32           bpf_fd;
> > >               __u32           file_flags;
> > > +             /* same as dirfd in openat() syscall; see openat(2)
> > > +              * manpage for details of dirfd/path_fd and pathname semantics;
> > > +              * zero path_fd implies AT_FDCWD behavior
> > > +              */
> > > +             __u32           path_fd;
> > >       };
> >
> > So 0 is a valid file descriptor and can trivially be created and made to
> > refer to any file. Is this a conscious decision to have a zero value
> > imply AT_FDCWD and have you done this somewhere else in bpf already?
> > Because that's contrary to how any file descriptor based apis work.
> >
> > How this is usually solved for extensible structs is to have a flag
> > field that raises a flag to indicate that the fd fiel is set and thus 0
> > can be used as a valid value.
> >
> > The way you're doing it right now is very counterintuitive to userspace
> > and pretty much guaranteed to cause subtle bugs.
> 
> Yes, it's a very bpf()-specific convention we've settled on a while
> ago. It allows a cleaner and simpler backwards compatibility story
> without having to introduce new flags every single time. Most of BPF
> UAPI by now dictates that (otherwise valid) FD 0 can't be used to pass
> it to bpf() syscall. Most of the time users will be blissfully unaware
> because libbpf and other BPF libraries are checking for fd == 0 and
> dup()'ing them to avoid ever returning FD 0 to the user.
> 
> tl;dr, a conscious decision consistent with the rest of BPF UAPI. It
> is a bpf() peculiarity, yes.

Adding fsdevel so we're aware of this quirk.

So I'm not sure whether this was ever discussed on fsdevel when you took
the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
invalid value.

If it was discussed then great but if not then I would like to make it
very clear that if in the future you decide to introduce custom
semantics for vfs provided infrastructure - especially when exposed to
userspace - that you please Cc us.

You often make it very clear on the list that you don't like it when
anything that touches bpf code doesn't end up getting sent to the bpf
mailing list. It is exactly the same for us.

This is not a rant I'm really just trying to make sure that we agree on
common ground when it comes to touching each others code or semantic
assumptions.

I personally find this extremely weird to treat fd 0 as anything other
than a random fd number as it goes against any userspace assumptions and
drastically deviates from basically every file descriptor interface we
have. I mean, you're not just saying fd 0 is invalid you're even saying
it means AT_FDCWD.

For every other interface, including those that pass fds in structs
whose extensibility is premised on unknown fields being set to zero,
have ways to make fd 0 work just fine. You could've done that to without
inventing custom fd semantics.