Re: [PATCH bpf-next 1/2] Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"

Quentin Monnet <quentin@xxxxxxxxxxxxx> · Fri, 10 Jun 2022 18:00:24 +0100

2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@xxxxxxxxxx>
> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote:
>>
>> 2022-06-10 09:07 UTC-0700 ~ sdf@xxxxxxxxxx
>>> On 06/10, Quentin Monnet wrote:
>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
>>>
>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
>>>> kernel has switched to memcg-based memory accounting. Thanks to the
>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
>>>> with other systems and ask libbpf to raise the limit for us if
>>>> necessary.
>>>
>>>> How do we know if memcg-based accounting is supported? There is a probe
>>>> in libbpf to check this. But this probe currently relies on the
>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
>>>> landed in the same kernel version as the memory accounting change. This
>>>> works in the generic case, but it may fail, for example, if the helper
>>>> function has been backported to an older kernel. This has been observed
>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
>>>> available but rlimit is still in use. The probe succeeds, the rlimit is
>>>> not raised, and probing features with bpftool, for example, fails.
>>>
>>>> A patch was submitted [0] to update this probe in libbpf, based on what
>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
>>>> some hard-to-debug flakiness if another process starts, or the current
>>>> application is killed, while the rlimit is reduced, and the approach was
>>>> discarded.
>>>
>>>> As a workaround to ensure that the rlimit bump does not depend on the
>>>> availability of a given helper, we restore the unconditional rlimit bump
>>>> in bpftool for now.
>>>
>>>> [0]
>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@xxxxxxxxxxxxx/
>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
>>>
>>>> Cc: Yafang Shao <laoar.shao@xxxxxxxxx>
>>>> Signed-off-by: Quentin Monnet <quentin@xxxxxxxxxxxxx>
>>>> ---
>>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
>>>>   tools/bpf/bpftool/feature.c    | 2 ++
>>>>   tools/bpf/bpftool/main.c       | 6 +++---
>>>>   tools/bpf/bpftool/main.h       | 2 ++
>>>>   tools/bpf/bpftool/map.c        | 2 ++
>>>>   tools/bpf/bpftool/pids.c       | 1 +
>>>>   tools/bpf/bpftool/prog.c       | 3 +++
>>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
>>>>   8 files changed, 23 insertions(+), 3 deletions(-)
>>>
>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
>>>> index a45b42ee8ab0..a0d4acd7c54a 100644
>>>> --- a/tools/bpf/bpftool/common.c
>>>> +++ b/tools/bpf/bpftool/common.c
>>>> @@ -17,6 +17,7 @@
>>>>   #include <linux/magic.h>
>>>>   #include <net/if.h>
>>>>   #include <sys/mount.h>
>>>> +#include <sys/resource.h>
>>>>   #include <sys/stat.h>
>>>>   #include <sys/vfs.h>
>>>
>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
>>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
>>>>   }
>>>
>>>> +void set_max_rlimit(void)
>>>> +{
>>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
>>>> +
>>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
>>>
>>> Do you think it might make sense to print to stderr some warning if
>>> we actually happen to adjust this limit?
>>>
>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
>>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
>>>     infinity!\n");
>>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
>>> }
>>>
>>> ?
>>>
>>> Because while it's nice that we automatically do this, this might still
>>> lead to surprises for some users. OTOH, not sure whether people
>>> actually read those warnings? :-/
>>
>> I'm not strictly opposed to a warning, but I'm not completely sure this
>> is desirable.
>>
>> Bpftool has raised the rlimit for a long time, it changed only in April,
>> so I don't think it would come up as a surprise for people who have used
>> it for a while. I think this is also something that several other
>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
>> have been doing too.
> 
> In this case ignore me and let's continue doing that :-)
> 
> Btw, eventually we'd still like to stop doing that I'd presume?

Agreed. I was thinking either finding a way to improve the probe in
libbpf, or waiting for some more time until 5.11 gets old, but this may
take years :/

> Should
> we at some point follow up with something like:
> 
> if (kernel_version >= 5.11) { don't touch memlock; }
> 
> ?
> 
> I guess we care only about <5.11 because of the backports, but 5.11+
> kernels are guaranteed to have memcg.

You mean from uname() and parsing the release? Yes I suppose we could do
that, can do as a follow-up.

> 
> I'm not sure whether memlock is used out there in the distros (and
> especially for root/bpf_capable), so I'm also not sure whether we
> really care or not.

Not sure either. For what it's worth, I've never seen complaints so far
from users about the rlimit being raised (from bpftool or other BPF apps).