Re: [PATCH bpf-next 1/2] Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"

Yafang Shao <laoar.shao@xxxxxxxxx> · Wed, 15 Jun 2022 21:22:23 +0800

On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote:
>
> 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@xxxxxxxxx>
> > On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
> >>
> >> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote:
> >>>
> >>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@xxxxxxxxxx>
> >>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote:
> >>>>>
> >>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@xxxxxxxxxx
> >>>>>> On 06/10, Quentin Monnet wrote:
> >>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
> >>>>>>
> >>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
> >>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
> >>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the
> >>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
> >>>>>>> with other systems and ask libbpf to raise the limit for us if
> >>>>>>> necessary.
> >>>>>>
> >>>>>>> How do we know if memcg-based accounting is supported? There is a probe
> >>>>>>> in libbpf to check this. But this probe currently relies on the
> >>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
> >>>>>>> landed in the same kernel version as the memory accounting change. This
> >>>>>>> works in the generic case, but it may fail, for example, if the helper
> >>>>>>> function has been backported to an older kernel. This has been observed
> >>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
> >>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is
> >>>>>>> not raised, and probing features with bpftool, for example, fails.
> >>>>>>
> >>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what
> >>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
> >>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
> >>>>>>> some hard-to-debug flakiness if another process starts, or the current
> >>>>>>> application is killed, while the rlimit is reduced, and the approach was
> >>>>>>> discarded.
> >>>>>>
> >>>>>>> As a workaround to ensure that the rlimit bump does not depend on the
> >>>>>>> availability of a given helper, we restore the unconditional rlimit bump
> >>>>>>> in bpftool for now.
> >>>>>>
> >>>>>>> [0]
> >>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@xxxxxxxxxxxxx/
> >>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> >>>>>>
> >>>>>>> Cc: Yafang Shao <laoar.shao@xxxxxxxxx>
> >>>>>>> Signed-off-by: Quentin Monnet <quentin@xxxxxxxxxxxxx>
> >>>>>>> ---
> >>>>>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
> >>>>>>>   tools/bpf/bpftool/feature.c    | 2 ++
> >>>>>>>   tools/bpf/bpftool/main.c       | 6 +++---
> >>>>>>>   tools/bpf/bpftool/main.h       | 2 ++
> >>>>>>>   tools/bpf/bpftool/map.c        | 2 ++
> >>>>>>>   tools/bpf/bpftool/pids.c       | 1 +
> >>>>>>>   tools/bpf/bpftool/prog.c       | 3 +++
> >>>>>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
> >>>>>>>   8 files changed, 23 insertions(+), 3 deletions(-)
> >>>>>>
> >>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> >>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644
> >>>>>>> --- a/tools/bpf/bpftool/common.c
> >>>>>>> +++ b/tools/bpf/bpftool/common.c
> >>>>>>> @@ -17,6 +17,7 @@
> >>>>>>>   #include <linux/magic.h>
> >>>>>>>   #include <net/if.h>
> >>>>>>>   #include <sys/mount.h>
> >>>>>>> +#include <sys/resource.h>
> >>>>>>>   #include <sys/stat.h>
> >>>>>>>   #include <sys/vfs.h>
> >>>>>>
> >>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
> >>>>>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
> >>>>>>>   }
> >>>>>>
> >>>>>>> +void set_max_rlimit(void)
> >>>>>>> +{
> >>>>>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> >>>>>>> +
> >>>>>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
> >>>>>>
> >>>>>> Do you think it might make sense to print to stderr some warning if
> >>>>>> we actually happen to adjust this limit?
> >>>>>>
> >>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
> >>>>>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
> >>>>>>     infinity!\n");
> >>>>>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
> >>>>>> }
> >>>>>>
> >>>>>> ?
> >>>>>>
> >>>>>> Because while it's nice that we automatically do this, this might still
> >>>>>> lead to surprises for some users. OTOH, not sure whether people
> >>>>>> actually read those warnings? :-/
> >>>>>
> >>>>> I'm not strictly opposed to a warning, but I'm not completely sure this
> >>>>> is desirable.
> >>>>>
> >>>>> Bpftool has raised the rlimit for a long time, it changed only in April,
> >>>>> so I don't think it would come up as a surprise for people who have used
> >>>>> it for a while. I think this is also something that several other
> >>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
> >>>>> have been doing too.
> >>>>
> >>>> In this case ignore me and let's continue doing that :-)
> >>>>
> >>>> Btw, eventually we'd still like to stop doing that I'd presume?
> >>>
> >>> Agreed. I was thinking either finding a way to improve the probe in
> >>> libbpf, or waiting for some more time until 5.11 gets old, but this may
> >>> take years :/
> >>>
> >>>> Should
> >>>> we at some point follow up with something like:
> >>>>
> >>>> if (kernel_version >= 5.11) { don't touch memlock; }
> >>>>
> >>>> ?
> >>>>
> >>>> I guess we care only about <5.11 because of the backports, but 5.11+
> >>>> kernels are guaranteed to have memcg.
> >>>
> >>> You mean from uname() and parsing the release? Yes I suppose we could do
> >>> that, can do as a follow-up.
> >>
> >> Yeah, uname-based, I don't think we can do better? Given that probing
> >> is problematic as well :-(
> >> But idk, up to you.
> >>
> >
> > Agreed with the uname-based solution. Another possible solution is to
> > probe the member 'memcg' in struct bpf_map, in case someone may
> > backport memcg-based  memory accounting, but that will be a little
> > over-engineering. The uname-based solution is simple and can work.
> >
>
> Thanks! Yes, memcg would be more complex: the struct is not exposed to
> user space, and BTF is not a hard dependency for bpftool. I'll work on
> the uname-based test as a follow-up to this set.
>

After a second thought, the uname-based test may not work, because
CONFIG_MEMCG_KMEM can be disabled.
Maybe we can probe the member 'memcg' in struct bpf_map by parsing
/sys/kernel/btf/vmlinux:
[8584] STRUCT 'bpf_map' size=256 vlen=27
        'ops' type_id=8659 bits_offset=0
        'inner_map_meta' type_id=8587 bits_offset=64
        'security' type_id=93 bits_offset=128
        'map_type' type_id=8532 bits_offset=192
        'key_size' type_id=36 bits_offset=224
        'value_size' type_id=36 bits_offset=256
        'max_entries' type_id=36 bits_offset=288
        'map_extra' type_id=38 bits_offset=320
        'map_flags' type_id=36 bits_offset=384
        'spin_lock_off' type_id=21 bits_offset=416
        'timer_off' type_id=21 bits_offset=448
        'id' type_id=36 bits_offset=480
        'numa_node' type_id=21 bits_offset=512
        'btf_key_type_id' type_id=36 bits_offset=544
        'btf_value_type_id' type_id=36 bits_offset=576
        'btf_vmlinux_value_type_id' type_id=36 bits_offset=608
        'btf' type_id=8660 bits_offset=640
        'memcg' type_id=687 bits_offset=704                       <<<< here
        'name' type_id=337 bits_offset=768
        'bypass_spec_v1' type_id=63 bits_offset=896
        'frozen' type_id=63 bits_offset=904
        'refcnt' type_id=81 bits_offset=1024
        'usercnt' type_id=81 bits_offset=1088
        'work' type_id=484 bits_offset=1152
        'freeze_mutex' type_id=443 bits_offset=1408
        'writecnt' type_id=81 bits_offset=1664
        'owner' type_id=8658 bits_offset=1728

If 'memcg' exists, it is memcg-based, otherwise it is rlimit-based.

WDYT?

-- 
Regards
Yafang