Re: [RFC bpf-next v2 4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Fri, 28 Apr 2023 10:24:16 +0800

Hi Alexei,

On 4/27/2023 12:24 PM, Alexei Starovoitov wrote:
> On Sun, Apr 23, 2023 at 03:41:05PM +0800, Hou Tao wrote:
>>>> (3) reuse-after-rcu-gp bpf memory allocator
>>> that's the one you're implementing below, right?
>> Right.
>>>> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
>>>> | --                  | --         | --                   | --                |
>>>> | no_op               | 1276       | 0.96                 | 1.00              |
>>>> | overwrite           | 15.66      | 25.00                | 33.07             |
>>>> | batch_add_batch_del | 10.32      | 18.84                | 22.64             |
>>>> | add_del_on_diff_cpu | 13.00      | 550.50               | 748.74            |
>>>>
>>>> (4) free-after-rcu-gp bpf memory allocator (free directly through call_rcu)
>>> What do you mean? htab uses bpf_ma, but does call_rcu before doing bpf_mem_free ?
>> No, there is no call_rcu() before bpf_mem_free(). bpf_mem_free() in
>> free-after-rcu-gp flavor will do call_rcu() in batch to free these elements back
>> to slab subsystem directly. The elements in this flavor of bpf_ma is not safe
>> for access from sleepable program except bpf_rcu_read_{lock,unlock}() are used.
>>
>> But I think using call_rcu() to call bpf_mem_free() is good candidate for
>> comparison and I saw bpf_cpumask does that, so I modify bpf hash table to do the
>> similar thing and paste the benchmark result. As we can seen from the result,
>> the memory usage for such flavor is much bigger than reuse-after-rcu-gp and
>> free-after-rcu-gp:
> I don't follow what exactly you're doing and what you're measuring.
> Please provide patches for both reuse-after-rcu-gp and free-after-rcu-gp to
> have meaningful conversation.
OK. Will add a new flavor of FREE_AFTER_RCU_GP bpf memory allocator in v3.
> Rigth now we're stuck at what bench tool is actually measuring.
>
>>>> +		if (try_queue_work && !work_pending(&c->reuse_work)) {
>>>> +			/* Use reuse_cb_in_progress to indicate there is
>>>> +			 * inflight reuse kworker or reuse RCU callback.
>>>> +			 */
>>>> +			atomic_inc(&c->reuse_cb_in_progress);
>>>> +			/* Already queued */
>>>> +			if (!queue_work(bpf_ma_wq, &c->reuse_work))
>>> how many kthreads are spawned by wq in the peak?
>> I think it depends on the number of bpf_ma. Because bpf_ma_wq is per-CPU
>> workqueue, so for each bpf_ma, there is at most one worker for each CPU. And now
>> the limit for the number of active workers on each CPU is 256, but it is
>> customizable through alloc_workqueue() API.
> Which means that on 8 cpu system there will be 8 * 256 kthreads ?
> That's a lot. Please provide num_of_all_threads before/after/at_peak during bench.
Yes, 8 * 256 is a lot, but there are at most 8 kworkers during
benchmark, because there is only one bpf_memory_allocator is used.
>
> Pls trim your replies. Mailers like mutt have a hard time navigating.
Do you mean the email content didn't wrap automatically ? or the wrap
length is too lengthy (my current setting is 80) ?