On Tue, Dec 14, 2021 at 7:09 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > On 12/14/21 1:48 AM, Andrii Nakryiko wrote: > > The need to increase RLIMIT_MEMLOCK to do anything useful with BPF is > > one of the first extremely frustrating gotchas that all new BPF users go > > through and in some cases have to learn it a very hard way. > > > > Luckily, starting with upstream Linux kernel version 5.11, BPF subsystem > > dropped the dependency on memlock and uses memcg-based memory accounting > > instead. Unfortunately, detecting memcg-based BPF memory accounting is > > far from trivial (as can be evidenced by this patch), so in practice > > most BPF applications still do unconditional RLIMIT_MEMLOCK increase. > > > > As we move towards libbpf 1.0, it would be good to allow users to forget > > about RLIMIT_MEMLOCK vs memcg and let libbpf do the sensible adjustment > > automatically. This patch paves the way forward in this matter. Libbpf > > will do feature detection of memcg-based accounting, and if detected, > > will do nothing. But if the kernel is too old, just like BCC, libbpf > > will automatically increase RLIMIT_MEMLOCK on behalf of user > > application ([0]). > > > > As this is technically a breaking change, during the transition period > > applications have to opt into libbpf 1.0 mode by setting > > LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK bit when calling > > libbpf_set_strict_mode(). > > > > Libbpf allows to control the exact amount of set RLIMIT_MEMLOCK limit > > with libbpf_set_memlock_rlim_max() API. Passing 0 will make libbpf do > > nothing with RLIMIT_MEMLOCK. libbpf_set_memlock_rlim_max() has to be > > called before the first bpf_prog_load(), bpf_btf_load(), or > > bpf_object__load() call, otherwise it has no effect and will return > > -EBUSY. > > > > [0] Closes: https://github.com/libbpf/libbpf/issues/369 > > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > [...] > > > > +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to > > + * memcg-based memory accounting for BPF maps and progs. This was done in [0]. > > + * We use the difference in reporting memlock value in BPF map's fdinfo before > > + * and after [0] to detect whether memcg accounting is done for BPF subsystem > > + * or not. > > + * > > + * Before the change, memlock value for ARRAY map would be calculated as: > > + * > > + * memlock = sizeof(struct bpf_array) + round_up(value_size, 8) * max_entries; > > + * memlock = round_up(memlock, PAGE_SIZE); > > + * > > + * > > + * After, memlock is approximated as: > > + * > > + * memlock = round_up(key_size + value_size, 8) * max_entries; > > + * memlock = round_up(memlock, PAGE_SIZE); > > + * > > + * In this check we use the fact that sizeof(struct bpf_array) is about 300 > > + * bytes, so if we use value_size = (PAGE_SIZE - 100), before memcg > > + * approximation memlock would be rounded up to 2 * PAGE_SIZE, while with > > + * memcg approximation it will stay at single PAGE_SIZE (key_size is 4 for > > + * array and doesn't make much difference given 100 byte decrement we use for > > + * value_size). > > + * > > + * [0] https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@xxxxxx/ > > + */ > > +int probe_memcg_account(void) > > +{ > > + const size_t map_create_attr_sz = offsetofend(union bpf_attr, map_extra); > > + long page_sz = sysconf(_SC_PAGESIZE), memlock_sz; > > + char buf[128]; > > + union bpf_attr attr; > > + int map_fd; > > + FILE *f; > > + > > + memset(&attr, 0, map_create_attr_sz); > > + attr.map_type = BPF_MAP_TYPE_ARRAY; > > + attr.key_size = 4; > > + attr.value_size = page_sz - 100; > > + attr.max_entries = 1; > > + map_fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, map_create_attr_sz); > > + if (map_fd < 0) > > + return -errno; > > + > > + sprintf(buf, "/proc/self/fdinfo/%d", map_fd); > > + f = fopen(buf, "r"); > > + while (f && !feof(f) && fgets(buf, sizeof(buf), f)) { > > + if (fscanf(f, "memlock: %ld\n", &memlock_sz) == 1) { > > + fclose(f); > > + close(map_fd); > > + return memlock_sz == page_sz ? 1 : 0; > > + } > > + } > > + > > + /* proc FS is disabled or we failed to parse fdinfo properly, assume > > + * we need setrlimit > > + */ > > + if (f) > > + fclose(f); > > + close(map_fd); > > + return 0; > > +} > > One other option which might be slightly more robust perhaps could be to probe > for a BPF helper that has been added along with 5.11 kernel. As Toke noted earlier > it might not work with ooo backports, but if its good with RHEL in this specific > case, we should be covered for 99% of cases. Potentially, we could then still try > to fallback to the above probing logic? Ok, I was originally thinking of probe bpf_sock_from_file() (which was added after memcg change), but it's PITA. But I see that slightly before that (but in the same 5.11 release) bpf_ktime_get_coarse_ns() helper was added, which is the simplest helper to test for. Let me test that instead and it should be very reliable (apart from the out of order backports, but I personally can live with that, given that we fall back to a safe setrlimit() default). But I'm not going to do fallback, helper doesn't add any extra dependencies and should be very reliable. > > Thanks, > Daniel