Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 18 May 2023 14:56:07 -0700

On Wed, May 17, 2023 at 2:11 AM Christian Brauner <brauner@xxxxxxxxxx> wrote:
>
> On Tue, May 16, 2023 at 11:02:42AM -0700, Andrii Nakryiko wrote:
> > On Tue, May 16, 2023 at 2:07 AM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> > >
> > > On Mon, May 15, 2023 at 05:13:46PM -0700, Andrii Nakryiko wrote:
> > > > Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
> > > > forces users to specify pinning location as a string-based absolute or
> > > > relative (to current working directory) path. This has various
> > > > implications related to security (e.g., symlink-based attacks), forces
> > > > BPF FS to be exposed in the file system, which can cause races with
> > > > other applications.
> > > >
> > > > One of the feedbacks we got from folks working with containers heavily
> > > > was that inability to use purely FD-based location specification was an
> > > > unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
> > > > commands. This patch closes this oversight, adding path_fd field to
> > >
> > > Cool!
> > >
> > > > BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
> > > > *at() syscalls for dirfd + pathname combinations.
> > > >
> > > > This now allows interesting possibilities like working with detached BPF
> > > > FS mount (e.g., to perform multiple pinnings without running a risk of
> > > > someone interfering with them), and generally making pinning/getting
> > > > more secure and not prone to any races and/or security attacks.
> > > >
> > > > This is demonstrated by a selftest added in subsequent patch that takes
> > > > advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
> > > > creating detached BPF FS mount, pinning, and then getting BPF map out of
> > > > it, all while never exposing this private instance of BPF FS to outside
> > > > worlds.
> > > >
> > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > > > ---
> > > >  include/linux/bpf.h            |  4 ++--
> > > >  include/uapi/linux/bpf.h       |  5 +++++
> > > >  kernel/bpf/inode.c             | 16 ++++++++--------
> > > >  kernel/bpf/syscall.c           |  8 +++++---
> > > >  tools/include/uapi/linux/bpf.h |  5 +++++
> > > >  5 files changed, 25 insertions(+), 13 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 36e4b2d8cca2..f58895830ada 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
> > > >  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> > > >  struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
> > > >
> > > > -int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> > > > -int bpf_obj_get_user(const char __user *pathname, int flags);
> > > > +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
> > > > +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
> > > >
> > > >  #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
> > > >  #define DEFINE_BPF_ITER_FUNC(target, args...)                        \
> > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > index 1bb11a6ee667..db2870a52ce0 100644
> > > > --- a/include/uapi/linux/bpf.h
> > > > +++ b/include/uapi/linux/bpf.h
> > > > @@ -1420,6 +1420,11 @@ union bpf_attr {
> > > >               __aligned_u64   pathname;
> > > >               __u32           bpf_fd;
> > > >               __u32           file_flags;
> > > > +             /* same as dirfd in openat() syscall; see openat(2)
> > > > +              * manpage for details of dirfd/path_fd and pathname semantics;
> > > > +              * zero path_fd implies AT_FDCWD behavior
> > > > +              */
> > > > +             __u32           path_fd;
> > > >       };
> > >
> > > So 0 is a valid file descriptor and can trivially be created and made to
> > > refer to any file. Is this a conscious decision to have a zero value
> > > imply AT_FDCWD and have you done this somewhere else in bpf already?
> > > Because that's contrary to how any file descriptor based apis work.
> > >
> > > How this is usually solved for extensible structs is to have a flag
> > > field that raises a flag to indicate that the fd fiel is set and thus 0
> > > can be used as a valid value.
> > >
> > > The way you're doing it right now is very counterintuitive to userspace
> > > and pretty much guaranteed to cause subtle bugs.
> >
> > Yes, it's a very bpf()-specific convention we've settled on a while
> > ago. It allows a cleaner and simpler backwards compatibility story
> > without having to introduce new flags every single time. Most of BPF
> > UAPI by now dictates that (otherwise valid) FD 0 can't be used to pass
> > it to bpf() syscall. Most of the time users will be blissfully unaware
> > because libbpf and other BPF libraries are checking for fd == 0 and
> > dup()'ing them to avoid ever returning FD 0 to the user.
> >
> > tl;dr, a conscious decision consistent with the rest of BPF UAPI. It
> > is a bpf() peculiarity, yes.
>
> Adding fsdevel so we're aware of this quirk.
>
> So I'm not sure whether this was ever discussed on fsdevel when you took
> the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> invalid value.
>
> If it was discussed then great but if not then I would like to make it
> very clear that if in the future you decide to introduce custom
> semantics for vfs provided infrastructure - especially when exposed to
> userspace - that you please Cc us.

Yep, I'll remember to cc linux-fsdevel@xxxxxxxxxxxxxxx for future
patches touching on vfs-related concepts, no problem. I wasn't trying
to sneak it in or anything, it just never occurred to me, sorry about
that.

>
> You often make it very clear on the list that you don't like it when
> anything that touches bpf code doesn't end up getting sent to the bpf
> mailing list. It is exactly the same for us.

That's a fair request, ack.

>
> This is not a rant I'm really just trying to make sure that we agree on
> common ground when it comes to touching each others code or semantic
> assumptions.
>
> I personally find this extremely weird to treat fd 0 as anything other
> than a random fd number as it goes against any userspace assumptions and
> drastically deviates from basically every file descriptor interface we
> have. I mean, you're not just saying fd 0 is invalid you're even saying
> it means AT_FDCWD.

Agreed, I can see how this could have undesirable (not just
surprising) implications if, say, open(O_PATH) returned fd=0. I just
sent a v2 with a new flag that needs to be specified to be able to use
path FD (and falling back to backwards-compatible AT_FDCWD behavior
otherwise).

>
> For every other interface, including those that pass fds in structs
> whose extensibility is premised on unknown fields being set to zero,
> have ways to make fd 0 work just fine. You could've done that to without
> inventing custom fd semantics.