Re: [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF

Daniel Borkmann <daniel@xxxxxxxxxxxxx> · Tue, 14 Dec 2021 16:09:02 +0100

On 12/14/21 1:48 AM, Andrii Nakryiko wrote:
The need to increase RLIMIT_MEMLOCK to do anything useful with BPF is
one of the first extremely frustrating gotchas that all new BPF users go
through and in some cases have to learn it a very hard way.

Luckily, starting with upstream Linux kernel version 5.11, BPF subsystem
dropped the dependency on memlock and uses memcg-based memory accounting
instead. Unfortunately, detecting memcg-based BPF memory accounting is
far from trivial (as can be evidenced by this patch), so in practice
most BPF applications still do unconditional RLIMIT_MEMLOCK increase.

As we move towards libbpf 1.0, it would be good to allow users to forget
about RLIMIT_MEMLOCK vs memcg and let libbpf do the sensible adjustment
automatically. This patch paves the way forward in this matter. Libbpf
will do feature detection of memcg-based accounting, and if detected,
will do nothing. But if the kernel is too old, just like BCC, libbpf
will automatically increase RLIMIT_MEMLOCK on behalf of user
application ([0]).

As this is technically a breaking change, during the transition period
applications have to opt into libbpf 1.0 mode by setting
LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK bit when calling
libbpf_set_strict_mode().

Libbpf allows to control the exact amount of set RLIMIT_MEMLOCK limit
with libbpf_set_memlock_rlim_max() API. Passing 0 will make libbpf do
nothing with RLIMIT_MEMLOCK. libbpf_set_memlock_rlim_max() has to be
called before the first bpf_prog_load(), bpf_btf_load(), or
bpf_object__load() call, otherwise it has no effect and will return
-EBUSY.

   [0] Closes: https://github.com/libbpf/libbpf/issues/369

Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
[...]

+/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
+ * memcg-based memory accounting for BPF maps and progs. This was done in [0].
+ * We use the difference in reporting memlock value in BPF map's fdinfo before
+ * and after [0] to detect whether memcg accounting is done for BPF subsystem
+ * or not.
+ *
+ * Before the change, memlock value for ARRAY map would be calculated as:
+ *
+ *   memlock = sizeof(struct bpf_array) + round_up(value_size, 8) * max_entries;
+ *   memlock = round_up(memlock, PAGE_SIZE);
+ *
+ *
+ * After, memlock is approximated as:
+ *
+ *   memlock = round_up(key_size + value_size, 8) * max_entries;
+ *   memlock = round_up(memlock, PAGE_SIZE);
+ *
+ * In this check we use the fact that sizeof(struct bpf_array) is about 300
+ * bytes, so if we use value_size = (PAGE_SIZE - 100), before memcg
+ * approximation memlock would be rounded up to 2 * PAGE_SIZE, while with
+ * memcg approximation it will stay at single PAGE_SIZE (key_size is 4 for
+ * array and doesn't make much difference given 100 byte decrement we use for
+ * value_size).
+ *
+ *   [0] https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@xxxxxx/
+ */
+int probe_memcg_account(void)
+{
+	const size_t map_create_attr_sz = offsetofend(union bpf_attr, map_extra);
+	long page_sz = sysconf(_SC_PAGESIZE), memlock_sz;
+	char buf[128];
+	union bpf_attr attr;
+	int map_fd;
+	FILE *f;
+
+	memset(&attr, 0, map_create_attr_sz);
+	attr.map_type = BPF_MAP_TYPE_ARRAY;
+	attr.key_size = 4;
+	attr.value_size = page_sz - 100;
+	attr.max_entries = 1;
+	map_fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, map_create_attr_sz);
+	if (map_fd < 0)
+		return -errno;
+
+	sprintf(buf, "/proc/self/fdinfo/%d", map_fd);
+	f = fopen(buf, "r");
+	while (f && !feof(f) && fgets(buf, sizeof(buf), f)) {
+		if (fscanf(f, "memlock: %ld\n", &memlock_sz) == 1) {
+			fclose(f);
+			close(map_fd);
+			return memlock_sz == page_sz ? 1 : 0;
+		}
+	}
+
+	/* proc FS is disabled or we failed to parse fdinfo properly, assume
+	 * we need setrlimit
+	 */
+	if (f)
+		fclose(f);
+	close(map_fd);
+	return 0;
+}

One other option which might be slightly more robust perhaps could be to probe
for a BPF helper that has been added along with 5.11 kernel. As Toke noted earlier
it might not work with ooo backports, but if its good with RHEL in this specific
case, we should be covered for 99% of cases. Potentially, we could then still try
to fallback to the above probing logic?

Thanks,
Daniel