2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@xxxxxxxxxx> > On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote: >> >> 2022-06-10 09:07 UTC-0700 ~ sdf@xxxxxxxxxx >>> On 06/10, Quentin Monnet wrote: >>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. >>> >>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of >>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the >>>> kernel has switched to memcg-based memory accounting. Thanks to the >>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility >>>> with other systems and ask libbpf to raise the limit for us if >>>> necessary. >>> >>>> How do we know if memcg-based accounting is supported? There is a probe >>>> in libbpf to check this. But this probe currently relies on the >>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which >>>> landed in the same kernel version as the memory accounting change. This >>>> works in the generic case, but it may fail, for example, if the helper >>>> function has been backported to an older kernel. This has been observed >>>> for Google Cloud's Container-Optimized OS (COS), where the helper is >>>> available but rlimit is still in use. The probe succeeds, the rlimit is >>>> not raised, and probing features with bpftool, for example, fails. >>> >>>> A patch was submitted [0] to update this probe in libbpf, based on what >>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to >>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce >>>> some hard-to-debug flakiness if another process starts, or the current >>>> application is killed, while the rlimit is reduced, and the approach was >>>> discarded. >>> >>>> As a workaround to ensure that the rlimit bump does not depend on the >>>> availability of a given helper, we restore the unconditional rlimit bump >>>> in bpftool for now. >>> >>>> [0] >>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@xxxxxxxxxxxxx/ >>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 >>> >>>> Cc: Yafang Shao <laoar.shao@xxxxxxxxx> >>>> Signed-off-by: Quentin Monnet <quentin@xxxxxxxxxxxxx> >>>> --- >>>> tools/bpf/bpftool/common.c | 8 ++++++++ >>>> tools/bpf/bpftool/feature.c | 2 ++ >>>> tools/bpf/bpftool/main.c | 6 +++--- >>>> tools/bpf/bpftool/main.h | 2 ++ >>>> tools/bpf/bpftool/map.c | 2 ++ >>>> tools/bpf/bpftool/pids.c | 1 + >>>> tools/bpf/bpftool/prog.c | 3 +++ >>>> tools/bpf/bpftool/struct_ops.c | 2 ++ >>>> 8 files changed, 23 insertions(+), 3 deletions(-) >>> >>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c >>>> index a45b42ee8ab0..a0d4acd7c54a 100644 >>>> --- a/tools/bpf/bpftool/common.c >>>> +++ b/tools/bpf/bpftool/common.c >>>> @@ -17,6 +17,7 @@ >>>> #include <linux/magic.h> >>>> #include <net/if.h> >>>> #include <sys/mount.h> >>>> +#include <sys/resource.h> >>>> #include <sys/stat.h> >>>> #include <sys/vfs.h> >>> >>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) >>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; >>>> } >>> >>>> +void set_max_rlimit(void) >>>> +{ >>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; >>>> + >>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); >>> >>> Do you think it might make sense to print to stderr some warning if >>> we actually happen to adjust this limit? >>> >>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { >>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to >>> infinity!\n"); >>> setrlimit(RLIMIT_MEMLOCK, &rinf); >>> } >>> >>> ? >>> >>> Because while it's nice that we automatically do this, this might still >>> lead to surprises for some users. OTOH, not sure whether people >>> actually read those warnings? :-/ >> >> I'm not strictly opposed to a warning, but I'm not completely sure this >> is desirable. >> >> Bpftool has raised the rlimit for a long time, it changed only in April, >> so I don't think it would come up as a surprise for people who have used >> it for a while. I think this is also something that several other >> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) >> have been doing too. > > In this case ignore me and let's continue doing that :-) > > Btw, eventually we'd still like to stop doing that I'd presume? Agreed. I was thinking either finding a way to improve the probe in libbpf, or waiting for some more time until 5.11 gets old, but this may take years :/ > Should > we at some point follow up with something like: > > if (kernel_version >= 5.11) { don't touch memlock; } > > ? > > I guess we care only about <5.11 because of the backports, but 5.11+ > kernels are guaranteed to have memcg. You mean from uname() and parsing the release? Yes I suppose we could do that, can do as a follow-up. > > I'm not sure whether memlock is used out there in the distros (and > especially for root/bpf_capable), so I'm also not sure whether we > really care or not. Not sure either. For what it's worth, I've never seen complaints so far from users about the rlimit being raised (from bpftool or other BPF apps).