Re: [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Fri, 22 Nov 2024 19:29:50 -0800

On Wed, Nov 20, 2024 at 5:20 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> Hi Alexei,
>
> On 11/20/2024 9:16 AM, Alexei Starovoitov wrote:
> > On Sun, Nov 17, 2024 at 4:56 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >>
> >> +enum {
> >> +       LPM_TRIE_MA_IM = 0,
> >> +       LPM_TRIE_MA_LEAF,
> >> +       LPM_TRIE_MA_CNT,
> >> +};
> >> +
> >>  struct lpm_trie {
> >>         struct bpf_map                  map;
> >>         struct lpm_trie_node __rcu      *root;
> >> +       struct bpf_mem_alloc            ma[LPM_TRIE_MA_CNT];
> >> +       struct bpf_mem_alloc            *im_ma;
> >> +       struct bpf_mem_alloc            *leaf_ma;
> > We cannot use bpf_ma-s liberally like that.
> > Freelists are not huge, but we shouldn't be adding new bpf_ma
> > in every map and every use case.
> >
> > bpf_mem_cache_is_mergeable() in the previous patch also
> > leaks implementation details.
> >
> > Can you use bpf_global_ma for all nodes?
>
> Will try. However, there are mainly two differences between
> bpf_global_ma and map specific bpf_mem_alloc. The first one is the
> memory accounting problem. All memories allocated from bpf_global_ma
> will be accounted to the root memory cgroup instead of the current
> memory cgroup (due to the return value of get_memcg()). I think we could
> fix this partially by returning NULL instead of root_mem_cgroup when
> c->objcg is NULL. However, even with the fix, the memory account is
> still inaccurate, because these pre-allocated objects may be used by
> other maps instead of the map which triggers the pre-allocation.

That's a valid point.
Though we ignore this issue in bpf_obj_new and other places
if we can account into memgcg correctly we should do it.

> The
> second one is the freeing of freed objects  when destroying the map. For
> a map specific bpf_mem_alloc, most of these freed objects could be freed
> immediately back to slub, However, it is not true for the bpf_global_ma,
> because we could not tell whether the object belongs to a to-be-freed
> map or not. And also we can not drain the bpf_global_ma just like we do
> for bpf_mem_alloc.

I don't think it's a big issue here. Optimizing delays in the free path
is imo too soon. The extra complexity is not worth it.

Let's do one bpf_ma for lpm of size LPM_TRIE_MA_LEAF.
Inner nodes may be wasting memory and it's ok.
The whole LPM trie is not efficient anyway.
Micro-optiming at bpf_ma level is a small improvement compared
to rewriting the whole LPM map as a more performance and memory
efficient algorithm.