Re: [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Fri, 22 Nov 2024 11:36:53 +0800

Hi,

On 11/21/2024 8:50 PM, Toke Høiland-Jørgensen wrote:
> Thomas Weißschuh <thomas.weissschuh@xxxxxxxxxxxxx> writes:
>
>> On Thu, Nov 21, 2024 at 12:39:08PM +0100, Toke Høiland-Jørgensen wrote:
>>> Hou Tao <houtao@xxxxxxxxxxxxxxx> writes:
>>>
>>>> Fix these warnings by replacing kmalloc()/kfree()/kfree_rcu() with
>>>> equivalent bpf memory allocator APIs. Since intermediate node and leaf
>>>> node have fixed sizes, fixed-size allocation APIs are used.
>>>>
>>>> Two aspects of this change require explanation:
>>>>
>>>> 1. A new flag LPM_TREE_NODE_FLAG_ALLOC_LEAF is added to track the
>>>>    original allocator. This is necessary because during deletion, a leaf
>>>>    node may be used as an intermediate node. These nodes must be freed
>>>>    through the leaf allocator.
>>>> 2. The intermediate node allocator and leaf node allocator may be merged
>>>>    because value_size for LPM trie is usually small. The merging reduces
>>>>    the memory overhead of bpf memory allocator.
>>> This seems like an awfully complicated way to fix this. Couldn't we just
>>> move the node allocations in trie_update_elem() out so they happen
>>> before the trie lock is taken instead?

Had considered about it. However, it will allocate two nodes for each
update, I think it may be too expensive.
>> The problematic lock nesting is not between the trie lock and the
>> allocator lock but between each of them and any other lock in the kernel.
>> BPF programs can be called from any context through tracepoints.
>> In this specific case the issue was a tracepoint executed under the
>> workqueue lock.
> That is not the issue described in the commit message, though. If the
> goal is to make the lpm_trie map usable in any context, the commit
> message should be rewritten to reflect this, instead of mentioning a
> specific deadlock between the trie lock and the allocator lock.

The original intention of the patch set is trying to resolve the
multiple syzbot dead-lock locking reports [1]. All of these reports
involve trie->lock and the internal lock in kmalloc(). However, as
pointed out by Thomas, only fixing the order of trie->lock and internal
lock in kmalloc() is not enough. Will update the commit message to
reflect that.

[1]:
https://lore.kernel.org/bpf/e14d8882-4760-7c9c-0cfc-db04eda494ee@xxxxxxxxxxxxxxx/
>
> And in that case, I think it's better to use a single 'struct
> bpf_mem_alloc' per map (like hashmaps do). This will waste some memory
> for intermediate nodes, but that seems like an acceptable trade-off to
> avoid all the complexity around two different allocators.
>
> Not sure if Alexei's comment about too many allocators was aimed solely
> at this, or if he has issues even with having a single allocator per map
> as well; but in that case, that seems like something that should be
> fixed for hashmaps as well?

In my understanding, the motivation for using a shared bpf_mem_alloc
instead of a per-map one is to reduce the memory overhead of per-cpu
object freelist in each bpf memory allocator. The hash map is a bit
different, because these freed objects return back to bpf_mem_alloc may
be reused immediately, so if the freed object is reused by other
call-sites (e.g., no hash map) and the hlist_nulls_node is overwritten,
the lookup procedure of hash map may incur oops.

>
> -Toke