On Tue, Jun 14, 2022 at 1:34 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > On 6/14/22 4:20 PM, Quentin Monnet wrote: > > 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@xxxxxxxxx> > >> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > >>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote: > >>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@xxxxxxxxxx> > >>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote: > >>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@xxxxxxxxxx > >>>>>>> On 06/10, Quentin Monnet wrote: > >>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. > >>>>>>> > >>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of > >>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the > >>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the > >>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility > >>>>>>>> with other systems and ask libbpf to raise the limit for us if > >>>>>>>> necessary. > >>>>>>> > >>>>>>>> How do we know if memcg-based accounting is supported? There is a probe > >>>>>>>> in libbpf to check this. But this probe currently relies on the > >>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which > >>>>>>>> landed in the same kernel version as the memory accounting change. This > >>>>>>>> works in the generic case, but it may fail, for example, if the helper > >>>>>>>> function has been backported to an older kernel. This has been observed > >>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is > >>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is > >>>>>>>> not raised, and probing features with bpftool, for example, fails. > >>>>>>> > >>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what > >>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to > >>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce > >>>>>>>> some hard-to-debug flakiness if another process starts, or the current > >>>>>>>> application is killed, while the rlimit is reduced, and the approach was > >>>>>>>> discarded. > >>>>>>> > >>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the > >>>>>>>> availability of a given helper, we restore the unconditional rlimit bump > >>>>>>>> in bpftool for now. > >>>>>>> > >>>>>>>> [0] > >>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@xxxxxxxxxxxxx/ > >>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 > >>>>>>> > >>>>>>>> Cc: Yafang Shao <laoar.shao@xxxxxxxxx> > >>>>>>>> Signed-off-by: Quentin Monnet <quentin@xxxxxxxxxxxxx> > >>>>>>>> --- > >>>>>>>> tools/bpf/bpftool/common.c | 8 ++++++++ > >>>>>>>> tools/bpf/bpftool/feature.c | 2 ++ > >>>>>>>> tools/bpf/bpftool/main.c | 6 +++--- > >>>>>>>> tools/bpf/bpftool/main.h | 2 ++ > >>>>>>>> tools/bpf/bpftool/map.c | 2 ++ > >>>>>>>> tools/bpf/bpftool/pids.c | 1 + > >>>>>>>> tools/bpf/bpftool/prog.c | 3 +++ > >>>>>>>> tools/bpf/bpftool/struct_ops.c | 2 ++ > >>>>>>>> 8 files changed, 23 insertions(+), 3 deletions(-) > >>>>>>> > >>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c > >>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644 > >>>>>>>> --- a/tools/bpf/bpftool/common.c > >>>>>>>> +++ b/tools/bpf/bpftool/common.c > >>>>>>>> @@ -17,6 +17,7 @@ > >>>>>>>> #include <linux/magic.h> > >>>>>>>> #include <net/if.h> > >>>>>>>> #include <sys/mount.h> > >>>>>>>> +#include <sys/resource.h> > >>>>>>>> #include <sys/stat.h> > >>>>>>>> #include <sys/vfs.h> > >>>>>>> > >>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) > >>>>>>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; > >>>>>>>> } > >>>>>>> > >>>>>>>> +void set_max_rlimit(void) > >>>>>>>> +{ > >>>>>>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; > >>>>>>>> + > >>>>>>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); > >>>>>>> > >>>>>>> Do you think it might make sense to print to stderr some warning if > >>>>>>> we actually happen to adjust this limit? > >>>>>>> > >>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { > >>>>>>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to > >>>>>>> infinity!\n"); > >>>>>>> setrlimit(RLIMIT_MEMLOCK, &rinf); > >>>>>>> } > >>>>>>> > >>>>>>> ? > >>>>>>> > >>>>>>> Because while it's nice that we automatically do this, this might still > >>>>>>> lead to surprises for some users. OTOH, not sure whether people > >>>>>>> actually read those warnings? :-/ > >>>>>> > >>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this > >>>>>> is desirable. > >>>>>> > >>>>>> Bpftool has raised the rlimit for a long time, it changed only in April, > >>>>>> so I don't think it would come up as a surprise for people who have used > >>>>>> it for a while. I think this is also something that several other > >>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) > >>>>>> have been doing too. > >>>>> > >>>>> In this case ignore me and let's continue doing that :-) > >>>>> > >>>>> Btw, eventually we'd still like to stop doing that I'd presume? > >>>> > >>>> Agreed. I was thinking either finding a way to improve the probe in > >>>> libbpf, or waiting for some more time until 5.11 gets old, but this may > >>>> take years :/ > >>>> > >>>>> Should > >>>>> we at some point follow up with something like: > >>>>> > >>>>> if (kernel_version >= 5.11) { don't touch memlock; } > >>>>> > >>>>> ? > >>>>> > >>>>> I guess we care only about <5.11 because of the backports, but 5.11+ > >>>>> kernels are guaranteed to have memcg. > >>>> > >>>> You mean from uname() and parsing the release? Yes I suppose we could do > >>>> that, can do as a follow-up. > >>> > >>> Yeah, uname-based, I don't think we can do better? Given that probing > >>> is problematic as well :-( > >>> But idk, up to you. > >> > >> Agreed with the uname-based solution. Another possible solution is to > >> probe the member 'memcg' in struct bpf_map, in case someone may > >> backport memcg-based memory accounting, but that will be a little > >> over-engineering. The uname-based solution is simple and can work. > > > > Thanks! Yes, memcg would be more complex: the struct is not exposed to > > user space, and BTF is not a hard dependency for bpftool. I'll work on > > the uname-based test as a follow-up to this set. > > How would this work for things like RHEL? Maybe two potential workarounds ... > 1) We could use a different helper for the probing and see how far we get with > it.. though not great, probably still rare enough that we would run into it > again. 2) Maybe we could create a temp memcg and check whether we get accounted > against it on prog load (e.g. despite high rlimit)? This might be dangerous as well: we don't want to move to a different cgroup and move back; if we are in a multithreaded application, other threads might potentially create resources in that temp directory. And I don't think we can fork as well since we are in a library :-(