Re: [PATCH bpf-next 0/7] Free htab element out of bucket lock

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Wed, 8 Jan 2025 09:24:02 +0800

Hi,

On 1/7/2025 8:09 PM, Hou Tao wrote:
>
> On 1/7/2025 4:55 PM, Hou Tao wrote:
>> From: Hou Tao <houtao1@xxxxxxxxxx>
>>
>> Hi,
>>
>> The patch set continues the previous work [1] to move all the freeings
>> of htab elements out of bucket lock. One motivation for the patch set is
>> the locking problem reported by Sebastian [2]: the freeing of bpf_timer
>> under PREEMPT_RT may acquire a spin-lock (namely softirq_expiry_lock).
>> However the freeing procedure for htab element has already held a
>> raw-spin-lock (namely bucket lock), and it will trigger the warning:
>> "BUG: scheduling while atomic" as demonstrated by the selftests patch.
>> Another motivation is to reduce the locked scope of bucket lock.
>>
>> The patch set is structured as follows:
>>
>> * Patch #1 moves the element freeing out of lock for
>>   htab_lru_map_delete_node()
>> * Patch #2~#3 move the element freeing out of lock for
>>   __htab_map_lookup_and_delete_elem()
>> * Patch #4~#6 move the element freeing out of lock for
>>   htab_map_update_elem()
>> * Patch #7 adds a selftest for the locking problem
>>
>> The changes for htab_map_update_elem() require some explanation. The
>> reason that the previous work [1] can't move the element freeing out of
>> the bucket lock for preallocated hash table is due to ->extra_elems
>> optimization. When alloc_htab_elem() returns, the existed-old element
>> has already been stashed in per-cpu ->extra_elems. To handle that, patch
>> #5~#7 break the reuse of ->extra_elems and the refill of ->extra_elems
>> into two independent steps, do resue with bucket lock being held and do
>> refill after unlocking the bucket lock. The downside is that concurrent
>> updates on the same CPU may need to pop free element from per-cpu list
>> instead of reusing ->extra_elems directly, but I think such case will be
>> rare.
> Er, the break of reuse and refill of ->extra_elems is buggy. It failed
> the htab_update/concurrent_update in BPF CI occasionally. It may also
> return the unexpected E2BIG error when the map is full and there are
> concurrent overwrites procedure on the same CPU. Need to figure out
> another way to handle the lock problem.

Other approach is to ensure that the reuse of extra_elems and the free
of the special fields in the extra_elems run sequentially. Considering
the reuse of extra_elems may happen in a IRQ context, I still can not
figure out a better way to handle the lock problem.  I decide to change
the implementation of bpf_timer_delete_work() to fix the lock problem:
it will use hrtimer_try_to_cancel() firstly, if the function returns -1,
bpf_timer_delete_work() will try to queue a work to cancel the timer again.

@Sebastian
Is it possible that softirq_expiry_lock is changed to a raw-spin-lock
instead ?
>> Please see individual patches for more details. Comments are always
>> welcome.
>>
>> [1]: https://lore.kernel.org/bpf/20241106063542.357743-1-houtao@xxxxxxxxxxxxxxx
>> [2]: https://lore.kernel.org/bpf/20241106084527.4gPrMnHt@xxxxxxxxxxxxx
>>
>> Hou Tao (7):
>>   bpf: Free special fields after unlock in htab_lru_map_delete_node()
>>   bpf: Bail out early in __htab_map_lookup_and_delete_elem()
>>   bpf: Free element after unlock in __htab_map_lookup_and_delete_elem()
>>   bpf: Support refilling extra_elems in free_htab_elem()
>>   bpf: Factor out the element allocation for pre-allocated htab
>>   bpf: Free element after unlock for pre-allocated htab
>>   selftests/bpf: Add test case for the freeing of bpf_timer
>>
>>  kernel/bpf/hashtab.c                          | 170 ++++++++++--------
>>  .../selftests/bpf/prog_tests/free_timer.c     | 165 +++++++++++++++++
>>  .../testing/selftests/bpf/progs/free_timer.c  |  71 ++++++++
>>  3 files changed, 332 insertions(+), 74 deletions(-)
>>  create mode 100644 tools/testing/selftests/bpf/prog_tests/free_timer.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/free_timer.c
>>