Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 6/8/2023 7:23 AM, Alexei Starovoitov wrote:
> On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
>> <alexei.starovoitov@xxxxxxxxx> wrote:
>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
>>>> As said in the commit message, the command line for test is
>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
>>>> using default max_entries and the number of CPUs is greater than 15,
>>>> use_percpu_counter will be false.
>>> Right. percpu or not depends on number of cpus.
>>>
>>>> I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
>>>> test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
>>>> there are obvious performance degradation.
>>> ...
>>>> [root@hello bpf]# ./map_perf_test 4 8 16384
>>>> 2:hash_map_perf kmalloc 359201 events per sec
>>> ..
>>>> [root@hello bpf]# ./map_perf_test 4 8 16384
>>>> 4:hash_map_perf kmalloc 203983 events per sec
>>> this is indeed a degration in a VM.
>>>
>>>> I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
>>>> performances for "./map_perf_test 4 8" are similar, but there is obvious
>>>> performance degradation for "./map_perf_test 4 8 16384"
>>> but... a degradation?
>>>
>>>> Before reuse-after-rcu-gp:
>>>>
>>>> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
>>>> 1:hash_map_perf kmalloc 388088 events per sec
>>> ...
>>>> After reuse-after-rcu-gp:
>>>> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
>>>> 5:hash_map_perf kmalloc 655628 events per sec
>>> This is a big improvement :) Not a degration.
>>> You always have to double check the numbers with perf report.
>>>
>>>> So could you please double check your setup and rerun map_perf_test ? If
>>>> there is no performance degradation, could you please share your setup
>>>> and your kernel configure file ?
>>> I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
>>> Playing with it a bit more I found something interesting:
>>> map_perf_test 4 8 16348
>>> before/after has too much noise to be conclusive.
>>>
>>> So I did
>>> map_perf_test 4 8 16348 1000000
>>>
>>> and now I see significant degration from patch 3.
>>> It drops from 800k to 200k.
>>> And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
>>> The following hack addresses most of the perf degradtion:
>>>
>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>> index fea1cb0c78bb..eeadc9359097 100644
>>> --- a/kernel/bpf/memalloc.c
>>> +++ b/kernel/bpf/memalloc.c
>>> @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
>>>         alloc = 0;
>>>         head = NULL;
>>>         tail = NULL;
>>> -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
>>> +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
>>>         while (alloc < cnt) {
>>>                 obj = __llist_del_first(&sc->reuse_ready_head);
>>>                 if (obj) {
>>> @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
>>>                 alloc++;
>>>         }
>>>         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>> +       }
>>>
>>>         if (alloc) {
>>>                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
>>> @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
>>>                 sc->reuse_ready_tail = NULL;
>>>                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
>>>                 __llist_add_batch(head, tail, &sc->wait_for_free);
>>> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>>                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
>>> +       } else {
>>> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>>         }
>>> -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>>  }
>>>
>>> It now drops from 800k to 450k.
>>> And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
>>> So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
>> Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
>> I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> An update..
>
> I tweaked patch 3 to do per-cpu reuse_ready and it addressed
> the lock contention, but cache miss on
> __llist_del_first(&c->reuse_ready_head);
> was still very high and performance was still at 450k as
> with a simple hack above.
>
> Then I removed some of the _tail optimizations and added counters
> to these llists.
> To my surprise
> map_perf_test 4 1 16348 1000000
> was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
> and ~400k sitting in reuse_ready_head.
Yep. If you use htab-mem-bechmark in patch #2, you will find the same
results, the same long single lists and the same huge memory usage.
>
> Then noticed that we should be doing:
> call_rcu_hurry(&c->rcu, reuse_rcu);
> instead of call_rcu(),
> but my config didn't have RCU_LAZY, so that didn't help.
> Obviously we cannot allow such a huge number of elements to sit
> in these link lists.
> The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
I think the main blocker is the huge memory usage (it is the same thing
as the long wait_for_reuse and wait_for_free list), right ?
> To unblock qp-trie work I suggest to add rcu_head to each inner node
> and do call_rcu() on them before free-ing them to bpf_mem_alloc.
> Explicit call_rcu would disqualify qp-tree from tracing programs though :(




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux