Re: [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 12 Jan 2023 13:05:17 -0800

On Thu, Jan 12, 2023 at 7:53 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
>
> Currently there's no way to get BPF memory usage, while we can only
> estimate the usage by bpftool or memcg, both of which are not reliable.
>
> - bpftool
>   `bpftool {map,prog} show` can show us the memlock of each map and
>   prog, but the memlock is vary from the real memory size. The memlock
>   of a bpf object is approximately
>   `round_up(key_size + value_size, 8) * max_entries`,
>   so 1) it can't apply to the non-preallocated bpf map which may
>   increase or decrease the real memory size dynamically. 2) the element
>   size of some bpf map is not `key_size + value_size`, for example the
>   element size of htab is
>   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
>   That said the differece between these two values may be very great if
>   the key_size and value_size is small. For example in my verifaction,
>   the size of memlock and real memory of a preallocated hash map are,
>
>   $ grep BPF /proc/meminfo
>   BPF:                 350 kB  <<< the size of preallocated memalloc pool
>
>   (create hash map)
>
>   $ bpftool map show
>   41549: hash  name count_map  flags 0x0
>         key 4B  value 4B  max_entries 1048576  memlock 8388608B
>
>   $ grep BPF /proc/meminfo
>   BPF:               82284 kB
>
>   So the real memory size is $((82284 - 350)) which is 81934 kB
>   while the memlock is only 8192 kB.

hashmap with key 4b and value 4b looks artificial to me,
but since you're concerned with accuracy of bpftool reporting,
please fix the estimation in bpf_map_memory_footprint().
You're correct that:

> size of some bpf map is not `key_size + value_size`, for example the
>   element size of htab is
>   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`

So just teach bpf_map_memory_footprint() to do this more accurately.
Add bucket size to it as well.
Make it even more accurate with prealloc vs not.
Much simpler change than adding run-time overhead to every alloc/free
on bpf side.

Higher level point:
bpf side tracks all of its allocation. There is no need to do that
in generic mm side.
Exposing an aggregated single number if /proc/meminfo also looks wrong.
People should be able to "bpftool map show|awk sum of fields"
and get the same number.