On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote: > > 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@xxxxxxxxx> > > On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > >> > >> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote: > >>> > >>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@xxxxxxxxxx> > >>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote: > >>>>> > >>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@xxxxxxxxxx > >>>>>> On 06/10, Quentin Monnet wrote: > >>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. > >>>>>> > >>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of > >>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the > >>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the > >>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility > >>>>>>> with other systems and ask libbpf to raise the limit for us if > >>>>>>> necessary. > >>>>>> > >>>>>>> How do we know if memcg-based accounting is supported? There is a probe > >>>>>>> in libbpf to check this. But this probe currently relies on the > >>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which > >>>>>>> landed in the same kernel version as the memory accounting change. This > >>>>>>> works in the generic case, but it may fail, for example, if the helper > >>>>>>> function has been backported to an older kernel. This has been observed > >>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is > >>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is > >>>>>>> not raised, and probing features with bpftool, for example, fails. > >>>>>> > >>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what > >>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to > >>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce > >>>>>>> some hard-to-debug flakiness if another process starts, or the current > >>>>>>> application is killed, while the rlimit is reduced, and the approach was > >>>>>>> discarded. > >>>>>> > >>>>>>> As a workaround to ensure that the rlimit bump does not depend on the > >>>>>>> availability of a given helper, we restore the unconditional rlimit bump > >>>>>>> in bpftool for now. > >>>>>> > >>>>>>> [0] > >>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@xxxxxxxxxxxxx/ > >>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 > >>>>>> > >>>>>>> Cc: Yafang Shao <laoar.shao@xxxxxxxxx> > >>>>>>> Signed-off-by: Quentin Monnet <quentin@xxxxxxxxxxxxx> > >>>>>>> --- > >>>>>>> tools/bpf/bpftool/common.c | 8 ++++++++ > >>>>>>> tools/bpf/bpftool/feature.c | 2 ++ > >>>>>>> tools/bpf/bpftool/main.c | 6 +++--- > >>>>>>> tools/bpf/bpftool/main.h | 2 ++ > >>>>>>> tools/bpf/bpftool/map.c | 2 ++ > >>>>>>> tools/bpf/bpftool/pids.c | 1 + > >>>>>>> tools/bpf/bpftool/prog.c | 3 +++ > >>>>>>> tools/bpf/bpftool/struct_ops.c | 2 ++ > >>>>>>> 8 files changed, 23 insertions(+), 3 deletions(-) > >>>>>> > >>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c > >>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644 > >>>>>>> --- a/tools/bpf/bpftool/common.c > >>>>>>> +++ b/tools/bpf/bpftool/common.c > >>>>>>> @@ -17,6 +17,7 @@ > >>>>>>> #include <linux/magic.h> > >>>>>>> #include <net/if.h> > >>>>>>> #include <sys/mount.h> > >>>>>>> +#include <sys/resource.h> > >>>>>>> #include <sys/stat.h> > >>>>>>> #include <sys/vfs.h> > >>>>>> > >>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) > >>>>>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; > >>>>>>> } > >>>>>> > >>>>>>> +void set_max_rlimit(void) > >>>>>>> +{ > >>>>>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; > >>>>>>> + > >>>>>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); > >>>>>> > >>>>>> Do you think it might make sense to print to stderr some warning if > >>>>>> we actually happen to adjust this limit? > >>>>>> > >>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { > >>>>>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to > >>>>>> infinity!\n"); > >>>>>> setrlimit(RLIMIT_MEMLOCK, &rinf); > >>>>>> } > >>>>>> > >>>>>> ? > >>>>>> > >>>>>> Because while it's nice that we automatically do this, this might still > >>>>>> lead to surprises for some users. OTOH, not sure whether people > >>>>>> actually read those warnings? :-/ > >>>>> > >>>>> I'm not strictly opposed to a warning, but I'm not completely sure this > >>>>> is desirable. > >>>>> > >>>>> Bpftool has raised the rlimit for a long time, it changed only in April, > >>>>> so I don't think it would come up as a surprise for people who have used > >>>>> it for a while. I think this is also something that several other > >>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) > >>>>> have been doing too. > >>>> > >>>> In this case ignore me and let's continue doing that :-) > >>>> > >>>> Btw, eventually we'd still like to stop doing that I'd presume? > >>> > >>> Agreed. I was thinking either finding a way to improve the probe in > >>> libbpf, or waiting for some more time until 5.11 gets old, but this may > >>> take years :/ > >>> > >>>> Should > >>>> we at some point follow up with something like: > >>>> > >>>> if (kernel_version >= 5.11) { don't touch memlock; } > >>>> > >>>> ? > >>>> > >>>> I guess we care only about <5.11 because of the backports, but 5.11+ > >>>> kernels are guaranteed to have memcg. > >>> > >>> You mean from uname() and parsing the release? Yes I suppose we could do > >>> that, can do as a follow-up. > >> > >> Yeah, uname-based, I don't think we can do better? Given that probing > >> is problematic as well :-( > >> But idk, up to you. > >> > > > > Agreed with the uname-based solution. Another possible solution is to > > probe the member 'memcg' in struct bpf_map, in case someone may > > backport memcg-based memory accounting, but that will be a little > > over-engineering. The uname-based solution is simple and can work. > > > > Thanks! Yes, memcg would be more complex: the struct is not exposed to > user space, and BTF is not a hard dependency for bpftool. I'll work on > the uname-based test as a follow-up to this set. > After a second thought, the uname-based test may not work, because CONFIG_MEMCG_KMEM can be disabled. Maybe we can probe the member 'memcg' in struct bpf_map by parsing /sys/kernel/btf/vmlinux: [8584] STRUCT 'bpf_map' size=256 vlen=27 'ops' type_id=8659 bits_offset=0 'inner_map_meta' type_id=8587 bits_offset=64 'security' type_id=93 bits_offset=128 'map_type' type_id=8532 bits_offset=192 'key_size' type_id=36 bits_offset=224 'value_size' type_id=36 bits_offset=256 'max_entries' type_id=36 bits_offset=288 'map_extra' type_id=38 bits_offset=320 'map_flags' type_id=36 bits_offset=384 'spin_lock_off' type_id=21 bits_offset=416 'timer_off' type_id=21 bits_offset=448 'id' type_id=36 bits_offset=480 'numa_node' type_id=21 bits_offset=512 'btf_key_type_id' type_id=36 bits_offset=544 'btf_value_type_id' type_id=36 bits_offset=576 'btf_vmlinux_value_type_id' type_id=36 bits_offset=608 'btf' type_id=8660 bits_offset=640 'memcg' type_id=687 bits_offset=704 <<<< here 'name' type_id=337 bits_offset=768 'bypass_spec_v1' type_id=63 bits_offset=896 'frozen' type_id=63 bits_offset=904 'refcnt' type_id=81 bits_offset=1024 'usercnt' type_id=81 bits_offset=1088 'work' type_id=484 bits_offset=1152 'freeze_mutex' type_id=443 bits_offset=1408 'writecnt' type_id=81 bits_offset=1664 'owner' type_id=8658 bits_offset=1728 If 'memcg' exists, it is memcg-based, otherwise it is rlimit-based. WDYT? -- Regards Yafang