Re: [PATCH bpf-next] bpftool: Probe for memcg-based accounting before bumping rlimit

Quentin Monnet <quentin@xxxxxxxxxxxxx> · Wed, 29 Jun 2022 12:11:20 +0100

On 28/06/2022 18:53, Stanislav Fomichev wrote:
> On Tue, Jun 28, 2022 at 9:45 AM Quentin Monnet <quentin@xxxxxxxxxxxxx> wrote:
>>
>> Bpftool used to bump the memlock rlimit to make sure to be able to load
>> BPF objects. After the kernel has switched to memcg-based memory
>> accounting [0] in 5.11, bpftool has relied on libbpf to probe the system
>> for memcg-based accounting support and for raising the rlimit if
>> necessary [1]. But this was later reverted, because the probe would
>> sometimes fail, resulting in bpftool not being able to load all required
>> objects [2].
>>
>> Here we add a more efficient probe, in bpftool itself. We first lower
>> the rlimit to 0, then we attempt to load a BPF object (and finally reset
>> the rlimit): if the load succeeds, then memcg-based memory accounting is
>> supported.
>>
>> This approach was earlier proposed for the probe in libbpf itself [3],
>> but given that the library may be used in multithreaded applications,
>> the probe could have undesirable consequences if one thread attempts to
>> lock kernel memory while memlock rlimit is at 0. Since bpftool is
>> single-threaded and the rlimit is process-based, this is fine to do in
>> bpftool itself.
>>
>> This probe was inspired by the similar one from the cilium/ebpf Go
>> library [4].
>>
>> [0] commit 97306be45fbe ("Merge branch 'switch to memcg-based memory accounting'")
>> [1] commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK")
>> [2] commit 6b4384ff1088 ("Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"")
>> [3] https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@xxxxxxxxxxxxx/t/#u
>> [4] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
>>
>> Cc: Stanislav Fomichev <sdf@xxxxxxxxxx>
>> Cc: Yafang Shao <laoar.shao@xxxxxxxxx>
>> Suggested-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
>> Signed-off-by: Quentin Monnet <quentin@xxxxxxxxxxxxx>
>> ---
>>  tools/bpf/bpftool/common.c   | 71 ++++++++++++++++++++++++++++++++++--
>>  tools/include/linux/kernel.h |  5 +++
>>  2 files changed, 73 insertions(+), 3 deletions(-)
>>
>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
>> index a0d4acd7c54a..e07769802f76 100644
>> --- a/tools/bpf/bpftool/common.c
>> +++ b/tools/bpf/bpftool/common.c
>> @@ -13,14 +13,17 @@
>>  #include <stdlib.h>
>>  #include <string.h>
>>  #include <unistd.h>
>> -#include <linux/limits.h>
>> -#include <linux/magic.h>
>>  #include <net/if.h>
>>  #include <sys/mount.h>
>>  #include <sys/resource.h>
>>  #include <sys/stat.h>
>>  #include <sys/vfs.h>
>>
>> +#include <linux/filter.h>
>> +#include <linux/limits.h>
>> +#include <linux/magic.h>
>> +#include <linux/unistd.h>
>> +
>>  #include <bpf/bpf.h>
>>  #include <bpf/hashmap.h>
>>  #include <bpf/libbpf.h> /* libbpf_num_possible_cpus */
>> @@ -73,11 +76,73 @@ static bool is_bpffs(char *path)
>>         return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
>>  }
>>
>> +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
>> + * memcg-based memory accounting for BPF maps and programs. This was done in
>> + * commit 97306be45fbe ("Merge branch 'switch to memcg-based memory
>> + * accounting'"), in Linux 5.11.
>> + *
>> + * Libbpf also offers to probe for memcg-based accounting vs rlimit, but does
>> + * so by checking for the availability of a given BPF helper and this has
>> + * failed on some kernels with backports in the past, see commit 6b4384ff1088
>> + * ("Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"").
>> + * Instead, we can probe by lowering the process-based rlimit to 0, trying to
>> + * load a BPF object, and resetting the rlimit. If the load succeeds then
>> + * memcg-based accounting is supported.
>> + *
>> + * This would be too dangerous to do in the library, because multithreaded
>> + * applications might attempt to load items while the rlimit is at 0. Given
>> + * that bpftool is single-threaded, this is fine to do here.
>> + */
>> +static bool known_to_need_rlimit(void)
>> +{
>> +       const size_t prog_load_attr_sz = offsetofend(union bpf_attr, attach_btf_obj_fd);
> 
> nit:
> Any specific reason you're hard coding this sz via offseofend? Why not
> use sizeof(bpf_attr) directly as a syscall/memset size?
> The kernel should handle all these cases where bpftool has extra zero
> padding, right?

No particular reason. Good point, I'll send a v2 to address this.

Thanks,
Quentin