On 28/06/2022 18:53, Stanislav Fomichev wrote: > On Tue, Jun 28, 2022 at 9:45 AM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote: >> >> Bpftool used to bump the memlock rlimit to make sure to be able to load >> BPF objects. After the kernel has switched to memcg-based memory >> accounting [0] in 5.11, bpftool has relied on libbpf to probe the system >> for memcg-based accounting support and for raising the rlimit if >> necessary [1]. But this was later reverted, because the probe would >> sometimes fail, resulting in bpftool not being able to load all required >> objects [2]. >> >> Here we add a more efficient probe, in bpftool itself. We first lower >> the rlimit to 0, then we attempt to load a BPF object (and finally reset >> the rlimit): if the load succeeds, then memcg-based memory accounting is >> supported. >> >> This approach was earlier proposed for the probe in libbpf itself [3], >> but given that the library may be used in multithreaded applications, >> the probe could have undesirable consequences if one thread attempts to >> lock kernel memory while memlock rlimit is at 0. Since bpftool is >> single-threaded and the rlimit is process-based, this is fine to do in >> bpftool itself. >> >> This probe was inspired by the similar one from the cilium/ebpf Go >> library [4]. >> >> [0] commit 97306be45fbe ("Merge branch 'switch to memcg-based memory accounting'") >> [1] commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK") >> [2] commit 6b4384ff1088 ("Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"") >> [3] https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@xxxxxxxxxxxxx/t/#u >> [4] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 >> >> Cc: Stanislav Fomichev <sdf@xxxxxxxxxx> >> Cc: Yafang Shao <laoar.shao@xxxxxxxxx> >> Suggested-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx> >> Signed-off-by: Quentin Monnet <quentin@xxxxxxxxxxxxx> >> --- >> tools/bpf/bpftool/common.c | 71 ++++++++++++++++++++++++++++++++++-- >> tools/include/linux/kernel.h | 5 +++ >> 2 files changed, 73 insertions(+), 3 deletions(-) >> >> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c >> index a0d4acd7c54a..e07769802f76 100644 >> --- a/tools/bpf/bpftool/common.c >> +++ b/tools/bpf/bpftool/common.c >> @@ -13,14 +13,17 @@ >> #include <stdlib.h> >> #include <string.h> >> #include <unistd.h> >> -#include <linux/limits.h> >> -#include <linux/magic.h> >> #include <net/if.h> >> #include <sys/mount.h> >> #include <sys/resource.h> >> #include <sys/stat.h> >> #include <sys/vfs.h> >> >> +#include <linux/filter.h> >> +#include <linux/limits.h> >> +#include <linux/magic.h> >> +#include <linux/unistd.h> >> + >> #include <bpf/bpf.h> >> #include <bpf/hashmap.h> >> #include <bpf/libbpf.h> /* libbpf_num_possible_cpus */ >> @@ -73,11 +76,73 @@ static bool is_bpffs(char *path) >> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; >> } >> >> +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to >> + * memcg-based memory accounting for BPF maps and programs. This was done in >> + * commit 97306be45fbe ("Merge branch 'switch to memcg-based memory >> + * accounting'"), in Linux 5.11. >> + * >> + * Libbpf also offers to probe for memcg-based accounting vs rlimit, but does >> + * so by checking for the availability of a given BPF helper and this has >> + * failed on some kernels with backports in the past, see commit 6b4384ff1088 >> + * ("Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK""). >> + * Instead, we can probe by lowering the process-based rlimit to 0, trying to >> + * load a BPF object, and resetting the rlimit. If the load succeeds then >> + * memcg-based accounting is supported. >> + * >> + * This would be too dangerous to do in the library, because multithreaded >> + * applications might attempt to load items while the rlimit is at 0. Given >> + * that bpftool is single-threaded, this is fine to do here. >> + */ >> +static bool known_to_need_rlimit(void) >> +{ >> + const size_t prog_load_attr_sz = offsetofend(union bpf_attr, attach_btf_obj_fd); > > nit: > Any specific reason you're hard coding this sz via offseofend? Why not > use sizeof(bpf_attr) directly as a syscall/memset size? > The kernel should handle all these cases where bpftool has extra zero > padding, right? No particular reason. Good point, I'll send a v2 to address this. Thanks, Quentin