Re: [PATCH bpf-next 1/2] bpf: Reuse freed element in free_by_rcu during allocation

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Wed, 7 Dec 2022 10:20:21 +0800

Hi,

On 12/7/2022 9:52 AM, Yonghong Song wrote:
>
>
> On 12/5/22 8:29 PM, Hou Tao wrote:
>> From: Hou Tao <houtao1@xxxxxxxxxx>
>>
>> When there are batched freeing operations on a specific CPU, part of
>> the freed elements ((high_watermark - lower_watermark) / 2 + 1) will
>> be moved to waiting_for_gp list and the remaining part will be left in
>> free_by_rcu list and waits for the expiration of RCU-tasks-trace grace
>> period and the next invocation of free_bulk().
>
> The change below LGTM. However, the above description seems not precise.
> IIUC, free_by_rcu list => waiting_for_gp is controlled by whether
> call_rcu_in_progress is true or not. If it is true, free_by_rcu list
> will remain intact and not moving into waiting_for_gp list.
> So it is not 'the remaining part will be left in free_by_rcu'.
> It is all elements in free_by_rcu to waiting_for_gp or none.
Thanks for the review and the suggestions. I tried to say that moving from
free_by_rcu to waiting_for_gp is slow, and there can be many free elements being
stacked on free_by_rcu list. So how about the following rephrasing or do you
still prefer "It is all elements in free_by_rcu to waiting_for_gp or none."  ?

When there are batched freeing operations on a specific CPU, part of the freed
elements ((high_watermark - lower_watermark) / 2 + 1) will be moved to
waiting_for_gp list  and the remaining part will be left in free_by_rcu list.
These elements in free_by_rcu list will be moved into waiting_for_gp list after
one RCU-tasks-trace grace period and another invocation of free_bulk(), so there
may be many free elements being stacked on free_by_rcu_list.

>>
>> So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to
>> allocate a new object, in alloc_bulk() just check whether or not there
>> is freed element in free_by_rcu and reuse it if available.
>>
>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
>
> LGTM except the above suggestions.
>
> Acked-by: Yonghong Song <yhs@xxxxxx>
>
>> ---
>>   kernel/bpf/memalloc.c | 21 ++++++++++++++++++---
>>   1 file changed, 18 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>> index 8f0d65f2474a..7daf147bc8f6 100644
>> --- a/kernel/bpf/memalloc.c
>> +++ b/kernel/bpf/memalloc.c
>> @@ -171,9 +171,24 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt,
>> int node)
>>       memcg = get_memcg(c);
>>       old_memcg = set_active_memcg(memcg);
>>       for (i = 0; i < cnt; i++) {
>> -        obj = __alloc(c, node);
>> -        if (!obj)
>> -            break;
>> +        /*
>> +         * free_by_rcu is only manipulated by irq work refill_work().
>> +         * IRQ works on the same CPU are called sequentially, so it is
>> +         * safe to use __llist_del_first() here. If alloc_bulk() is
>> +         * invoked by the initial prefill, there will be no running
>> +         * irq work, so __llist_del_first() is fine as well.
>> +         *
>> +         * In most cases, objects on free_by_rcu are from the same CPU.
>> +         * If some objects come from other CPUs, it doesn't incur any
>> +         * harm because NUMA_NO_NODE means the preference for current
>> +         * numa node and it is not a guarantee.
>> +         */
>> +        obj = __llist_del_first(&c->free_by_rcu);
>> +        if (!obj) {
>> +            obj = __alloc(c, node);
>> +            if (!obj)
>> +                break;
>> +        }
>>           if (IS_ENABLED(CONFIG_PREEMPT_RT))
>>               /* In RT irq_work runs in per-cpu kthread, so disable
>>                * interrupts to avoid preemption and interrupts and
>
> .