Hi, On 11/21/2024 8:50 PM, Toke Høiland-Jørgensen wrote: > Thomas Weißschuh <thomas.weissschuh@xxxxxxxxxxxxx> writes: > >> On Thu, Nov 21, 2024 at 12:39:08PM +0100, Toke Høiland-Jørgensen wrote: >>> Hou Tao <houtao@xxxxxxxxxxxxxxx> writes: >>> >>>> Fix these warnings by replacing kmalloc()/kfree()/kfree_rcu() with >>>> equivalent bpf memory allocator APIs. Since intermediate node and leaf >>>> node have fixed sizes, fixed-size allocation APIs are used. >>>> >>>> Two aspects of this change require explanation: >>>> >>>> 1. A new flag LPM_TREE_NODE_FLAG_ALLOC_LEAF is added to track the >>>> original allocator. This is necessary because during deletion, a leaf >>>> node may be used as an intermediate node. These nodes must be freed >>>> through the leaf allocator. >>>> 2. The intermediate node allocator and leaf node allocator may be merged >>>> because value_size for LPM trie is usually small. The merging reduces >>>> the memory overhead of bpf memory allocator. >>> This seems like an awfully complicated way to fix this. Couldn't we just >>> move the node allocations in trie_update_elem() out so they happen >>> before the trie lock is taken instead? Had considered about it. However, it will allocate two nodes for each update, I think it may be too expensive. >> The problematic lock nesting is not between the trie lock and the >> allocator lock but between each of them and any other lock in the kernel. >> BPF programs can be called from any context through tracepoints. >> In this specific case the issue was a tracepoint executed under the >> workqueue lock. > That is not the issue described in the commit message, though. If the > goal is to make the lpm_trie map usable in any context, the commit > message should be rewritten to reflect this, instead of mentioning a > specific deadlock between the trie lock and the allocator lock. The original intention of the patch set is trying to resolve the multiple syzbot dead-lock locking reports [1]. All of these reports involve trie->lock and the internal lock in kmalloc(). However, as pointed out by Thomas, only fixing the order of trie->lock and internal lock in kmalloc() is not enough. Will update the commit message to reflect that. [1]: https://lore.kernel.org/bpf/e14d8882-4760-7c9c-0cfc-db04eda494ee@xxxxxxxxxxxxxxx/ > > And in that case, I think it's better to use a single 'struct > bpf_mem_alloc' per map (like hashmaps do). This will waste some memory > for intermediate nodes, but that seems like an acceptable trade-off to > avoid all the complexity around two different allocators. > > Not sure if Alexei's comment about too many allocators was aimed solely > at this, or if he has issues even with having a single allocator per map > as well; but in that case, that seems like something that should be > fixed for hashmaps as well? In my understanding, the motivation for using a shared bpf_mem_alloc instead of a per-map one is to reduce the memory overhead of per-cpu object freelist in each bpf memory allocator. The hash map is a bit different, because these freed objects return back to bpf_mem_alloc may be reused immediately, so if the freed object is reused by other call-sites (e.g., no hash map) and the hlist_nulls_node is overwritten, the lookup procedure of hash map may incur oops. > > -Toke