On Thu, Jun 16, 2022 at 9:59 PM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote: > > 2022-06-16 00:05 UTC+0800 ~ Yafang Shao <laoar.shao@xxxxxxxxx> > > On Wed, Jun 15, 2022 at 11:52 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > >> > >> On Wed, Jun 15, 2022 at 6:23 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > >>> > >>> On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote: > >>>> > >>>> 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@xxxxxxxxx> > >>>>> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > >>>>>> > >>>>>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote: > >>>>>>> > >>>>>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@xxxxxxxxxx> > >>>>>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote: > >>>>>>>>> > >>>>>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@xxxxxxxxxx > >>>>>>>>>> On 06/10, Quentin Monnet wrote: > >>>>>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. > >>>>>>>>>> > >>>>>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of > >>>>>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the > >>>>>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the > >>>>>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility > >>>>>>>>>>> with other systems and ask libbpf to raise the limit for us if > >>>>>>>>>>> necessary. > >>>>>>>>>> > >>>>>>>>>>> How do we know if memcg-based accounting is supported? There is a probe > >>>>>>>>>>> in libbpf to check this. But this probe currently relies on the > >>>>>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which > >>>>>>>>>>> landed in the same kernel version as the memory accounting change. This > >>>>>>>>>>> works in the generic case, but it may fail, for example, if the helper > >>>>>>>>>>> function has been backported to an older kernel. This has been observed > >>>>>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is > >>>>>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is > >>>>>>>>>>> not raised, and probing features with bpftool, for example, fails. > >>>>>>>>>> > >>>>>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what > >>>>>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to > >>>>>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce > >>>>>>>>>>> some hard-to-debug flakiness if another process starts, or the current > >>>>>>>>>>> application is killed, while the rlimit is reduced, and the approach was > >>>>>>>>>>> discarded. > >>>>>>>>>> > >>>>>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the > >>>>>>>>>>> availability of a given helper, we restore the unconditional rlimit bump > >>>>>>>>>>> in bpftool for now. > >>>>>>>>>> > >>>>>>>>>>> [0] > >>>>>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@xxxxxxxxxxxxx/ > >>>>>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 > >>>>>>>>>> > >>>>>>>>>>> Cc: Yafang Shao <laoar.shao@xxxxxxxxx> > >>>>>>>>>>> Signed-off-by: Quentin Monnet <quentin@xxxxxxxxxxxxx> > >>>>>>>>>>> --- > >>>>>>>>>>> tools/bpf/bpftool/common.c | 8 ++++++++ > >>>>>>>>>>> tools/bpf/bpftool/feature.c | 2 ++ > >>>>>>>>>>> tools/bpf/bpftool/main.c | 6 +++--- > >>>>>>>>>>> tools/bpf/bpftool/main.h | 2 ++ > >>>>>>>>>>> tools/bpf/bpftool/map.c | 2 ++ > >>>>>>>>>>> tools/bpf/bpftool/pids.c | 1 + > >>>>>>>>>>> tools/bpf/bpftool/prog.c | 3 +++ > >>>>>>>>>>> tools/bpf/bpftool/struct_ops.c | 2 ++ > >>>>>>>>>>> 8 files changed, 23 insertions(+), 3 deletions(-) > >>>>>>>>>> > >>>>>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c > >>>>>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644 > >>>>>>>>>>> --- a/tools/bpf/bpftool/common.c > >>>>>>>>>>> +++ b/tools/bpf/bpftool/common.c > >>>>>>>>>>> @@ -17,6 +17,7 @@ > >>>>>>>>>>> #include <linux/magic.h> > >>>>>>>>>>> #include <net/if.h> > >>>>>>>>>>> #include <sys/mount.h> > >>>>>>>>>>> +#include <sys/resource.h> > >>>>>>>>>>> #include <sys/stat.h> > >>>>>>>>>>> #include <sys/vfs.h> > >>>>>>>>>> > >>>>>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) > >>>>>>>>>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; > >>>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>>> +void set_max_rlimit(void) > >>>>>>>>>>> +{ > >>>>>>>>>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; > >>>>>>>>>>> + > >>>>>>>>>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); > >>>>>>>>>> > >>>>>>>>>> Do you think it might make sense to print to stderr some warning if > >>>>>>>>>> we actually happen to adjust this limit? > >>>>>>>>>> > >>>>>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { > >>>>>>>>>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to > >>>>>>>>>> infinity!\n"); > >>>>>>>>>> setrlimit(RLIMIT_MEMLOCK, &rinf); > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> ? > >>>>>>>>>> > >>>>>>>>>> Because while it's nice that we automatically do this, this might still > >>>>>>>>>> lead to surprises for some users. OTOH, not sure whether people > >>>>>>>>>> actually read those warnings? :-/ > >>>>>>>>> > >>>>>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this > >>>>>>>>> is desirable. > >>>>>>>>> > >>>>>>>>> Bpftool has raised the rlimit for a long time, it changed only in April, > >>>>>>>>> so I don't think it would come up as a surprise for people who have used > >>>>>>>>> it for a while. I think this is also something that several other > >>>>>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) > >>>>>>>>> have been doing too. > >>>>>>>> > >>>>>>>> In this case ignore me and let's continue doing that :-) > >>>>>>>> > >>>>>>>> Btw, eventually we'd still like to stop doing that I'd presume? > >>>>>>> > >>>>>>> Agreed. I was thinking either finding a way to improve the probe in > >>>>>>> libbpf, or waiting for some more time until 5.11 gets old, but this may > >>>>>>> take years :/ > >>>>>>> > >>>>>>>> Should > >>>>>>>> we at some point follow up with something like: > >>>>>>>> > >>>>>>>> if (kernel_version >= 5.11) { don't touch memlock; } > >>>>>>>> > >>>>>>>> ? > >>>>>>>> > >>>>>>>> I guess we care only about <5.11 because of the backports, but 5.11+ > >>>>>>>> kernels are guaranteed to have memcg. > >>>>>>> > >>>>>>> You mean from uname() and parsing the release? Yes I suppose we could do > >>>>>>> that, can do as a follow-up. > >>>>>> > >>>>>> Yeah, uname-based, I don't think we can do better? Given that probing > >>>>>> is problematic as well :-( > >>>>>> But idk, up to you. > >>>>>> > >>>>> > >>>>> Agreed with the uname-based solution. Another possible solution is to > >>>>> probe the member 'memcg' in struct bpf_map, in case someone may > >>>>> backport memcg-based memory accounting, but that will be a little > >>>>> over-engineering. The uname-based solution is simple and can work. > >>>>> > >>>> > >>>> Thanks! Yes, memcg would be more complex: the struct is not exposed to > >>>> user space, and BTF is not a hard dependency for bpftool. I'll work on > >>>> the uname-based test as a follow-up to this set. > >>>> > >>> > >>> After a second thought, the uname-based test may not work, because > >>> CONFIG_MEMCG_KMEM can be disabled. > >> > >> Does it matter? Regardless of whether there is memcg or not, we > >> shouldn't touch ulimit on 5.11+ > >> If there is no memcg, there is no bpf memory enforcement. > > > > Right, rlimit-based accounting is totally removed, that is not the > > same with what I thought before, while I thought it will fallback to > > rlimit-based if kmemcg is disabled. > > Agreed, and so I've got a patch ready for the uname-based probe. > > But talking about this with Daniel, we were wondering if it would make > sense instead to have the probe I had initially submitted (lower the > rlimit to 0, attempt to load a program, reset rlimit - see [0]), but > only for bpftool instead of libbpf? My understanding is that the memlock > rlimit is per-process, right? So this shouldn't affect any other > process, and because bpftool is not multithreaded, nothing other than > probing would happen while the rlimit is at zero? Makes sense. It is safe to do the probe within bpftool. > Or is it just simpler, > if less accurate, to stick to the uname probe? > > Quentin > > [0] > https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@xxxxxxxxxxxxx/ -- Regards Yafang