Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Thu, 8 Jun 2023 11:43:54 +0800

Hi Paul,

On 6/8/2023 10:55 AM, Paul E. McKenney wrote:
> On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
>>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
>>> <alexei.starovoitov@xxxxxxxxx> wrote:
>>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
>>>>> As said in the commit message, the command line for test is
>>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
>>>>> using default max_entries and the number of CPUs is greater than 15,
>>>>> use_percpu_counter will be false.
>>>> Right. percpu or not depends on number of cpus.
SNIP
>>>>  known, because I had just proposed it in the email yesterday.
>>> Also noticed that the overhead of shared reuse_ready list
>>> comes both from the contended lock and from cache misses
>>> when one cpu pushes to the list after RCU GP and another
>>> cpu removes.
>>>
>>> Also low/batch/high watermark are all wrong in patch 3.
>>> low=32 and high=96 makes no sense when it's not a single list.
>>> I'm experimenting with 32 for all three heuristics.
>>>
>>> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
>>> are redundant.
>>> unit_free() can push into free_by_rcu directly
>>> then reuse_bulk() can fill it up with free_llist_extra and
>>> move them into waiting_for_gp.
>> Yes. Indeed missing that.
>>> All these _tail optimizations are obscuring the code and make it hard
>>> to notice these issues.
>>>
>>>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
>>>>
>>>> Answering your other email:
>>>>
>>>>> I see your point. I will continue to debug the memory usage difference
>>>>> between v3 and v4.
>>>> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
>> Don't agree with that. I still think the potential memory usage of v4 is
>> a problem and the difference memory usage between v3 and v4 reveals that
>> there is some peculiarity in RCU subsystem, because the difference comes
>> from the duration of RCU grace period. We need to find out why and fix
>> or workaround that.
> A tight loop in the kernel can extend RCU grace periods, especially
> for kernels built with CONFIG_PREEPTION=n.  Placing things like
> cond_resched() in such loops can help.  Of course, if you are in an
> RCU read-side critical section (or holding a spinlock), you will need
> to exit, cond_resched(), then re-enter.  Taking care to ensure that the
> state upon re-entry is valid.  After all, having exited either type of
> critical section, anything might happen.

As said in the help-wanted email just send out, it was weird that the
RCU grace period was extended a lot, up to ~150ms or more. But queue a
dummy kworker periodically which does nothing will help to reduce the
grace period to ~10ms. I have explained the context of the problem in
that email. And hope that we could get some help from you and the RCU
experts in the community.

Regards,
Tao
>
> 							Thanx, Paul
>
>>>>> I don't think so. Let's considering the per-cpu list first. Assume the
>>>>> normal RCU grace period is about 30ms and we are tracing the IO latency
>>>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
>>>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
>>>>> For the per-ma list, when the number of CPUs increased, it is easy to
>>>>> make the list contain thousands of elements.
>>>> That would be true only if there were no scheduling events in all of 176K ops.
>>>> Which is not the case.
>>>> I'm not sure why you're saying that RCU GP is 30ms.
>> Because these freed elements will be freed after one RCU GP and in v4
>> there is only one RCU callback per-cpu, so before one RCU GP is expired,
>> these freed elements will be accumulated on the list.
>>>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
>>>> Every sched event is sort-of implicit rcu_read_lock/unlock.
>>>> Network and block IO doesn't process 176K packets without resched.
>>>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
>>>>
>>>> For small size buckets low_watermark=32 and high=96.
>>>> We typically move 32 elements at a time from one list to another.
>>>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
>>>> is not instant, but once __free_rcu_tasks_trace is called we need to take
>>>> elements from waiting_for_gp one at a time and kfree it one at a time.
>>>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
>>>>
>>>>> Before I post v5, I want to know the reason why per-bpf-ma list is
>>>>> introduced. Previously, I though it was used to handle the case in which
>>>>> allocation and freeing are done on different CPUs.
>>>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
>>>>
>>>>> And as we can see
>>>>> from the benchmark result above and in v3, the performance and the
>>>>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
>>>> bench from patch 2 is invalid. Hence no conclusion can be made.
>>>>
>>>> So far the only bench we can trust and analyze is map_perf_test.
>>>> Please make bench in patch 2 yield the cpu after few updates.
>>>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
>>>> 64 updates is realistic too. A thousand is not.
>> Will do that.
>>