Re: [PATCH v2 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Sun, 12 Dec 2021 11:45:32 -0800

On Sat, Dec 11, 2021 at 6:23 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Sat, Dec 11, 2021 at 11:36 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> >
> > Andrii Nakryiko <andrii@xxxxxxxxxx> writes:
> >
> > > The need to increase RLIMIT_MEMLOCK to do anything useful with BPF is
> > > one of the first extremely frustrating gotchas that all new BPF users go
> > > through and in some cases have to learn it a very hard way.
> > >
> > > Luckily, starting with upstream Linux kernel version 5.11, BPF subsystem
> > > dropped the dependency on memlock and uses memcg-based memory accounting
> > > instead. Unfortunately, detecting memcg-based BPF memory accounting is
> > > far from trivial (as can be evidenced by this patch), so in practice
> > > most BPF applications still do unconditional RLIMIT_MEMLOCK increase.
> > >
> > > As we move towards libbpf 1.0, it would be good to allow users to forget
> > > about RLIMIT_MEMLOCK vs memcg and let libbpf do the sensible adjustment
> > > automatically. This patch paves the way forward in this matter. Libbpf
> > > will do feature detection of memcg-based accounting, and if detected,
> > > will do nothing. But if the kernel is too old, just like BCC, libbpf
> > > will automatically increase RLIMIT_MEMLOCK on behalf of user
> > > application ([0]).
> > >
> > > As this is technically a breaking change, during the transition period
> > > applications have to opt into libbpf 1.0 mode by setting
> > > LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK bit when calling
> > > libbpf_set_strict_mode().
> > >
> > > Libbpf allows to control the exact amount of set RLIMIT_MEMLOCK limit
> > > with libbpf_set_memlock_rlim_max() API. Passing 0 will make libbpf do
> > > nothing with RLIMIT_MEMLOCK. libbpf_set_memlock_rlim_max() has to be
> > > called before the first bpf_prog_load(), bpf_btf_load(), or
> > > bpf_object__load() call, otherwise it has no effect and will return
> > > -EBUSY.
> > >
> > >   [0] Closes: https://github.com/libbpf/libbpf/issues/369
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> >
> > The probing approach breaks with out-of-order backports, I suppose.
> > Hopefully no one will do those for that particular patch, though (it's
> > not really a bugfix), and at least for RHEL we did backport them
> > together.
> >
> > Can't think of any better ways of doing the detection either, but maybe
> > something to be aware of in the future (i.e., "don't change things in a
> > way that can't be detected from userspace")?

Initial version of memcg changes was supposed to return 0 memlock
value, when queried for BPF map. But it was decided to output some
approximation instead, so that one convenient way to detect this was
removed. I agree that kernel changes should show more care for
user-space detection in cases like this.

> >
> > Anyway, with the nits below:
> >
> > Acked-by: Toke Høiland-Jørgensen <toke@xxxxxxxxxx>
> >
> > > diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> > > index 6b2407e12060..7c82136979bf 100644
> > > --- a/tools/lib/bpf/bpf.c
> > > +++ b/tools/lib/bpf/bpf.c
> >
> > [...]
> >
> > > +     /* attempt loading freplace trying to use custom BTF */
> > > +     memset(&attr, 0, bpf_load_attr_sz);
> > > +     attr.prog_type = BPF_PROG_TYPE_TRACING;
> > > +     attr.expected_attach_type = BPF_TRACE_FENTRY;
> >
> > This comment also seems to be disagreeing with the code it's commenting
> > on?
> >

That's what I get for writing comments. Will fix it up, thanks for spotting.

> > [...]
> >
> > > +static bool memlock_bumped;
> > > +static rlim_t memlock_rlim_max = RLIM_INFINITY;
> > > +
> > > +int libbpf_set_memlock_rlim_max(size_t memlock_max)
> > > +{
> > > +     if (memlock_bumped)
> > > +             return libbpf_err(-EBUSY);
> > > +
> > > +     memlock_rlim_max = memlock_max;
> > > +     return 0;
> > > +}
> > > +
> > > +int bump_rlimit_memlock(void)
> > > +{
> > > +     struct rlimit rlim;
> > > +
> > > +     /* this the default in libbpf 1.0, but for now user has to opt-in explicitly */
> > > +     if (!(libbpf_mode & LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK))
> > > +             return 0;
> > > +
> > > +     /* if kernel supports memcg-based accounting, skip bumping RLIMIT_MEMLOCK */
> > > +     if (memlock_bumped || kernel_supports(NULL, FEAT_MEMCG_ACCOUNT))
> > > +             return 0;
> > > +
> > > +     memlock_bumped = true;
> > > +
> > > +     /* zero memlock_rlim_max disables auto-bumping RLIMIT_MEMLOCK */
> > > +     if (memlock_rlim_max == 0)
> > > +             return 0;
> > > +
> > > +     rlim.rlim_cur = rlim.rlim_max = memlock_rlim_max;
> > > +     if (setrlimit(RLIMIT_MEMLOCK, &rlim))
> > > +             return -errno;
> > > +
> > > +     return 0;
> > > +}
> > > +
> >
> > "rlim_max" seems to imply this will only ever increase the limit, but if
> > I'm reading the code correctly it could actually end up lowering the
> > effective limit?
>
> _max suffix is confusing to me as well.
> libbpf_set_memlock_rlim() would be more accurate.

The thinking was that we bump rlim_max value with setrlimit(), but
yeah, I'm totally fine naming it as libbpf_set_memlock_rlim(), I'll
update in the next revision.

>
> The other part of the patch:
> +               /* assume memcg accounting is supported if we get -ENOTSUPP */
> +               err = errno == 524 /* ENOTSUPP */ ? 1 : 0;
>
> is dangerous.
> ENOTSUPP is not a standard error code.
> The kernel returns it from a few places, but we might fix
> all of them eventually and return EOPNOTSUPP everywhere.
> In other words user space cannot rely on this 524 error code number.

I can check both ENOTSUPP and EOPNOTSUPP, if the error code is the
concern. Worst case, even if this check breaks, we'll just do
setrlimit() unnecessarily, which the user can prevent with
libbpf_set_memlock_rlim(0), if they know that all supported systems
they expect to run on have memcg accounting for BPF. If not, they
should be fine with setrlimit() anyways (or prevent it explicitly, for
libbpf 1.0 mode).