From: Hou Tao <houtao1@xxxxxxxxxx> Hi, The patchset tries to fix the problems found when checking how htab map handles element reuse in bpf memory allocator. The immediate reuse of freed elements may lead to two problems in htab map: (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in htab map value and it may corrupt lookup procedure with BFP_F_LOCK flag which acquires bpf-spin-lock during value copying. The corruption of bpf-spin-lock may result in hard lock-up. (2) lookup procedure may get incorrect map value if the found element is freed and then reused. Because the type of htab map elements are the same, so problem #1 can be fixed by supporting ctor in bpf memory allocator. The ctor initializes these special fields in map element only when the map element is newly allocated. If it is just a reused element, there will be no reinitialization. Problem #2 exists for both non-preallocated and preallocated htab map. By adding seq in htab element, doing reuse check and retrying the lookup procedure may be a feasible solution, but it will make the lookup API being hard to use, because the user needs to check whether the found element is reused or not and repeat the lookup procedure if it is reused. A simpler solution would be just disabling freed elements reuse and freeing these elements after lookup procedure ends. In order to reduce the overhead of call_rcu_tasks_trace() for each freed elements, freeing these elements in batch by moving these freed elements into a global per-cpu free list firstly, then after the number of freed elements reaches the threshold, these freed elements will be moved into a dymaically allocated object and being freed by a global per-cpu worker by calling call_rcu_tasks_trace(). Because the solution frees memory by allocating new memory, so if there is no memory available, the global per-cpu worker will call rcu_barrier_tasks_trace() to wait for the expiration of RCU grace period and free these free elements which have been spliced into a temporary list. And the newly freed elements will be freed after another round of rcu_barrier_tasks_trace() if there is still no memory. Maybe need to reserve some bpf_ma_free_batch to speed up the free. Now also doesn't consider the scenario when RCU grace period is slow. Because these newly-allocated memory (aka bpf_ma_free_batch) will be freed after the expiration of RCU grace period, so if grace period is slow, there may be too much bpf_ma_free_batch being allocated. Aftering applying BPF_MA_NO_REUSE in htab map, the performance of "./map_perf_test 4 18 8192" drops from 520K to 330K events per sec on one CPU. It is a big performance degradation, so hope to get some feedbacks on whether or not it is necessary and how to better fixing the reuse problem in htab map (global allocated object may have the same problems as htab map). Comments are always welcome. Regards, Hou Hou Tao (6): bpf: Support ctor in bpf memory allocator bpf: Factor out a common helper free_llist() bpf: Pass bitwise flags to bpf_mem_alloc_init() bpf: Introduce BPF_MA_NO_REUSE for bpf memory allocator bpf: Use BPF_MA_NO_REUSE in htab map selftests/bpf: Add test case for element reuse in htab map include/linux/bpf_mem_alloc.h | 12 +- kernel/bpf/core.c | 2 +- kernel/bpf/hashtab.c | 17 +- kernel/bpf/memalloc.c | 218 ++++++++++++++++-- .../selftests/bpf/prog_tests/htab_reuse.c | 111 +++++++++ .../testing/selftests/bpf/progs/htab_reuse.c | 19 ++ 6 files changed, 353 insertions(+), 26 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/htab_reuse.c create mode 100644 tools/testing/selftests/bpf/progs/htab_reuse.c -- 2.29.2