Hi, On 1/7/2025 8:09 PM, Hou Tao wrote: > > On 1/7/2025 4:55 PM, Hou Tao wrote: >> From: Hou Tao <houtao1@xxxxxxxxxx> >> >> Hi, >> >> The patch set continues the previous work [1] to move all the freeings >> of htab elements out of bucket lock. One motivation for the patch set is >> the locking problem reported by Sebastian [2]: the freeing of bpf_timer >> under PREEMPT_RT may acquire a spin-lock (namely softirq_expiry_lock). >> However the freeing procedure for htab element has already held a >> raw-spin-lock (namely bucket lock), and it will trigger the warning: >> "BUG: scheduling while atomic" as demonstrated by the selftests patch. >> Another motivation is to reduce the locked scope of bucket lock. >> >> The patch set is structured as follows: >> >> * Patch #1 moves the element freeing out of lock for >> htab_lru_map_delete_node() >> * Patch #2~#3 move the element freeing out of lock for >> __htab_map_lookup_and_delete_elem() >> * Patch #4~#6 move the element freeing out of lock for >> htab_map_update_elem() >> * Patch #7 adds a selftest for the locking problem >> >> The changes for htab_map_update_elem() require some explanation. The >> reason that the previous work [1] can't move the element freeing out of >> the bucket lock for preallocated hash table is due to ->extra_elems >> optimization. When alloc_htab_elem() returns, the existed-old element >> has already been stashed in per-cpu ->extra_elems. To handle that, patch >> #5~#7 break the reuse of ->extra_elems and the refill of ->extra_elems >> into two independent steps, do resue with bucket lock being held and do >> refill after unlocking the bucket lock. The downside is that concurrent >> updates on the same CPU may need to pop free element from per-cpu list >> instead of reusing ->extra_elems directly, but I think such case will be >> rare. > Er, the break of reuse and refill of ->extra_elems is buggy. It failed > the htab_update/concurrent_update in BPF CI occasionally. It may also > return the unexpected E2BIG error when the map is full and there are > concurrent overwrites procedure on the same CPU. Need to figure out > another way to handle the lock problem. Other approach is to ensure that the reuse of extra_elems and the free of the special fields in the extra_elems run sequentially. Considering the reuse of extra_elems may happen in a IRQ context, I still can not figure out a better way to handle the lock problem. I decide to change the implementation of bpf_timer_delete_work() to fix the lock problem: it will use hrtimer_try_to_cancel() firstly, if the function returns -1, bpf_timer_delete_work() will try to queue a work to cancel the timer again. @Sebastian Is it possible that softirq_expiry_lock is changed to a raw-spin-lock instead ? >> Please see individual patches for more details. Comments are always >> welcome. >> >> [1]: https://lore.kernel.org/bpf/20241106063542.357743-1-houtao@xxxxxxxxxxxxxxx >> [2]: https://lore.kernel.org/bpf/20241106084527.4gPrMnHt@xxxxxxxxxxxxx >> >> Hou Tao (7): >> bpf: Free special fields after unlock in htab_lru_map_delete_node() >> bpf: Bail out early in __htab_map_lookup_and_delete_elem() >> bpf: Free element after unlock in __htab_map_lookup_and_delete_elem() >> bpf: Support refilling extra_elems in free_htab_elem() >> bpf: Factor out the element allocation for pre-allocated htab >> bpf: Free element after unlock for pre-allocated htab >> selftests/bpf: Add test case for the freeing of bpf_timer >> >> kernel/bpf/hashtab.c | 170 ++++++++++-------- >> .../selftests/bpf/prog_tests/free_timer.c | 165 +++++++++++++++++ >> .../testing/selftests/bpf/progs/free_timer.c | 71 ++++++++ >> 3 files changed, 332 insertions(+), 74 deletions(-) >> create mode 100644 tools/testing/selftests/bpf/prog_tests/free_timer.c >> create mode 100644 tools/testing/selftests/bpf/progs/free_timer.c >>