Re: [PATCH bpf-next 1/2] Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"

Daniel Borkmann <daniel@xxxxxxxxxxxxx> · Tue, 14 Jun 2022 22:34:06 +0200

On 6/14/22 4:20 PM, Quentin Monnet wrote:
2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@xxxxxxxxx>
On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote:
2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@xxxxxxxxxx>
On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote:
2022-06-10 09:07 UTC-0700 ~ sdf@xxxxxxxxxx
On 06/10, Quentin Monnet wrote:
This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.

In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
kernel has switched to memcg-based memory accounting. Thanks to the
LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
with other systems and ask libbpf to raise the limit for us if
necessary.

How do we know if memcg-based accounting is supported? There is a probe
in libbpf to check this. But this probe currently relies on the
availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
landed in the same kernel version as the memory accounting change. This
works in the generic case, but it may fail, for example, if the helper
function has been backported to an older kernel. This has been observed
for Google Cloud's Container-Optimized OS (COS), where the helper is
available but rlimit is still in use. The probe succeeds, the rlimit is
not raised, and probing features with bpftool, for example, fails.

A patch was submitted [0] to update this probe in libbpf, based on what
the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
0, attempt to load a BPF object, and reset the rlimit. But it may induce
some hard-to-debug flakiness if another process starts, or the current
application is killed, while the rlimit is reduced, and the approach was
discarded.

As a workaround to ensure that the rlimit bump does not depend on the
availability of a given helper, we restore the unconditional rlimit bump
in bpftool for now.

[0]
https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@xxxxxxxxxxxxx/
[1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39

Cc: Yafang Shao <laoar.shao@xxxxxxxxx>
Signed-off-by: Quentin Monnet <quentin@xxxxxxxxxxxxx>
---
   tools/bpf/bpftool/common.c     | 8 ++++++++
   tools/bpf/bpftool/feature.c    | 2 ++
   tools/bpf/bpftool/main.c       | 6 +++---
   tools/bpf/bpftool/main.h       | 2 ++
   tools/bpf/bpftool/map.c        | 2 ++
   tools/bpf/bpftool/pids.c       | 1 +
   tools/bpf/bpftool/prog.c       | 3 +++
   tools/bpf/bpftool/struct_ops.c | 2 ++
   8 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
index a45b42ee8ab0..a0d4acd7c54a 100644
--- a/tools/bpf/bpftool/common.c
+++ b/tools/bpf/bpftool/common.c
@@ -17,6 +17,7 @@
   #include <linux/magic.h>
   #include <net/if.h>
   #include <sys/mount.h>
+#include <sys/resource.h>
   #include <sys/stat.h>
   #include <sys/vfs.h>

@@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
   }

+void set_max_rlimit(void)
+{
+    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
+
+    setrlimit(RLIMIT_MEMLOCK, &rinf);

Do you think it might make sense to print to stderr some warning if
we actually happen to adjust this limit?

if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
     infinity!\n");
     setrlimit(RLIMIT_MEMLOCK, &rinf);
}

?

Because while it's nice that we automatically do this, this might still
lead to surprises for some users. OTOH, not sure whether people
actually read those warnings? :-/

I'm not strictly opposed to a warning, but I'm not completely sure this
is desirable.

Bpftool has raised the rlimit for a long time, it changed only in April,
so I don't think it would come up as a surprise for people who have used
it for a while. I think this is also something that several other
BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
have been doing too.

In this case ignore me and let's continue doing that :-)

Btw, eventually we'd still like to stop doing that I'd presume?

Agreed. I was thinking either finding a way to improve the probe in
libbpf, or waiting for some more time until 5.11 gets old, but this may
take years :/

Should
we at some point follow up with something like:

if (kernel_version >= 5.11) { don't touch memlock; }

?

I guess we care only about <5.11 because of the backports, but 5.11+
kernels are guaranteed to have memcg.

You mean from uname() and parsing the release? Yes I suppose we could do
that, can do as a follow-up.

Yeah, uname-based, I don't think we can do better? Given that probing
is problematic as well :-(
But idk, up to you.

Agreed with the uname-based solution. Another possible solution is to
probe the member 'memcg' in struct bpf_map, in case someone may
backport memcg-based  memory accounting, but that will be a little
over-engineering. The uname-based solution is simple and can work.

Thanks! Yes, memcg would be more complex: the struct is not exposed to
user space, and BTF is not a hard dependency for bpftool. I'll work on
the uname-based test as a follow-up to this set.

How would this work for things like RHEL? Maybe two potential workarounds ...
1) We could use a different helper for the probing and see how far we get with
it.. though not great, probably still rare enough that we would run into it
again. 2) Maybe we could create a temp memcg and check whether we get accounted
against it on prog load (e.g. despite high rlimit)?

Thanks,
Daniel