Re: [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Thu, 21 Nov 2024 13:50:49 +0100

Thomas Weißschuh <thomas.weissschuh@xxxxxxxxxxxxx> writes:

> On Thu, Nov 21, 2024 at 12:39:08PM +0100, Toke Høiland-Jørgensen wrote:
>> Hou Tao <houtao@xxxxxxxxxxxxxxx> writes:
>> 
>> > Fix these warnings by replacing kmalloc()/kfree()/kfree_rcu() with
>> > equivalent bpf memory allocator APIs. Since intermediate node and leaf
>> > node have fixed sizes, fixed-size allocation APIs are used.
>> >
>> > Two aspects of this change require explanation:
>> >
>> > 1. A new flag LPM_TREE_NODE_FLAG_ALLOC_LEAF is added to track the
>> >    original allocator. This is necessary because during deletion, a leaf
>> >    node may be used as an intermediate node. These nodes must be freed
>> >    through the leaf allocator.
>> > 2. The intermediate node allocator and leaf node allocator may be merged
>> >    because value_size for LPM trie is usually small. The merging reduces
>> >    the memory overhead of bpf memory allocator.
>> 
>> This seems like an awfully complicated way to fix this. Couldn't we just
>> move the node allocations in trie_update_elem() out so they happen
>> before the trie lock is taken instead?
>
> The problematic lock nesting is not between the trie lock and the
> allocator lock but between each of them and any other lock in the kernel.
> BPF programs can be called from any context through tracepoints.
> In this specific case the issue was a tracepoint executed under the
> workqueue lock.

That is not the issue described in the commit message, though. If the
goal is to make the lpm_trie map usable in any context, the commit
message should be rewritten to reflect this, instead of mentioning a
specific deadlock between the trie lock and the allocator lock.

And in that case, I think it's better to use a single 'struct
bpf_mem_alloc' per map (like hashmaps do). This will waste some memory
for intermediate nodes, but that seems like an acceptable trade-off to
avoid all the complexity around two different allocators.

Not sure if Alexei's comment about too many allocators was aimed solely
at this, or if he has issues even with having a single allocator per map
as well; but in that case, that seems like something that should be
fixed for hashmaps as well?

-Toke