Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Paul,

On 6/9/2023 10:30 PM, Paul E. McKenney wrote:
> On Fri, Jun 09, 2023 at 11:02:59AM +0800, Hou Tao wrote:
>> Hi Paul,
>>
>> On 6/9/2023 12:18 AM, Paul E. McKenney wrote:
>>> On Thu, Jun 08, 2023 at 11:43:54AM +0800, Hou Tao wrote:
>>>> Hi Paul,
>>>>
>>>> On 6/8/2023 10:55 AM, Paul E. McKenney wrote:
>>>>> On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
>>>>>>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
>>>>>>> <alexei.starovoitov@xxxxxxxxx> wrote:
>>>>>>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
>>>>>>>>> As said in the commit message, the command line for test is
>>>>>>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
>>>>>>>>> using default max_entries and the number of CPUs is greater than 15,
>>>>>>>>> use_percpu_counter will be false.
>>>>>>>> Right. percpu or not depends on number of cpus.
>>>> SNIP
>>>>>>>>  known, because I had just proposed it in the email yesterday.
>>>>>>> Also noticed that the overhead of shared reuse_ready list
>>>>>>> comes both from the contended lock and from cache misses
>>>>>>> when one cpu pushes to the list after RCU GP and another
>>>>>>> cpu removes.
>>>>>>>
>>>>>>> Also low/batch/high watermark are all wrong in patch 3.
>>>>>>> low=32 and high=96 makes no sense when it's not a single list.
>>>>>>> I'm experimenting with 32 for all three heuristics.
>>>>>>>
>>>>>>> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
>>>>>>> are redundant.
>>>>>>> unit_free() can push into free_by_rcu directly
>>>>>>> then reuse_bulk() can fill it up with free_llist_extra and
>>>>>>> move them into waiting_for_gp.
>>>>>> Yes. Indeed missing that.
>>>>>>> All these _tail optimizations are obscuring the code and make it hard
>>>>>>> to notice these issues.
>>>>>>>
>>>>>>>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
>>>>>>>>
>>>>>>>> Answering your other email:
>>>>>>>>
>>>>>>>>> I see your point. I will continue to debug the memory usage difference
>>>>>>>>> between v3 and v4.
>>>>>>>> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
>>>>>> Don't agree with that. I still think the potential memory usage of v4 is
>>>>>> a problem and the difference memory usage between v3 and v4 reveals that
>>>>>> there is some peculiarity in RCU subsystem, because the difference comes
>>>>>> from the duration of RCU grace period. We need to find out why and fix
>>>>>> or workaround that.
>>>>> A tight loop in the kernel can extend RCU grace periods, especially
>>>>> for kernels built with CONFIG_PREEPTION=n.  Placing things like
>>>>> cond_resched() in such loops can help.  Of course, if you are in an
>>>>> RCU read-side critical section (or holding a spinlock), you will need
>>>>> to exit, cond_resched(), then re-enter.  Taking care to ensure that the
>>>>> state upon re-entry is valid.  After all, having exited either type of
>>>>> critical section, anything might happen.
>>>> As said in the help-wanted email just send out, it was weird that the
>>>> RCU grace period was extended a lot, up to ~150ms or more. But queue a
>>>> dummy kworker periodically which does nothing will help to reduce the
>>>> grace period to ~10ms. I have explained the context of the problem in
>>>> that email. And hope that we could get some help from you and the RCU
>>>> experts in the community.
>>> OK, I will bite...  Why do you think this is weird?
>>>
>>> RCU depends on context switches, among other things.  If you have a
>>> long-running in-kernel task, that will naturally extend the grace period.
>>> Scheduling the empty worker provided the context switch that RCU needed
>>> to make forward progress.
>> Because the empty kwork trick also works for CONFIG_PREEMPT_VOLUNTARY=y.
>> And there is neither implicit or explicit calling of schedule() in the
>> kernel space of producer thread when CONFIG_PREEMPT_VOLUNTARY=y.
>> And also I don't know how to define a long-running in-kernel task,
>> because I have checked the latency of getpgid() syscall in producer
>> thread when CONFIG_PREEMPT_VOLUNTARY=y . As shown in the text below,
>> there are indeed some outliers, but the most latency is less than 1ms,
>> so the producer thread will do contex_switch in at most 1ms. But before
>> the empty kwork trick, the latency of RCU callback is about 100ms or
>> more, and after the empty kwork trick, the latenct for RCU callback is
>> reduced to ~8ms.
>>
>> @hist_us:
>> [128, 256)            60
>> |                                                    |
>> [256, 512)        101364
>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [512, 1K)          17580
>> |@@@@@@@@@                                           |
>> [1K, 2K)              16
>> |                                                    |
>>
>> @stat_us: count 119020, average 436, total 51957277
> Ah!
>
> The trick is that if you have a non-nohz_full CPU spending most of its
> time in kernel space (in your case, due to repeated system calls), with
> high probability this CPU will appear to RCU as if it were spending all
> of its time in kernel space.  This probability will increase with the
> fraction of the time this CPU spends in kernel space.
>
> If I am reading your histogram correctly, the overhead of your system
> call is usually etween 256 and 512 microseconds.  If you are invoking
> that system call in a tight loop, that CPU might could easily be spending
> way more than 99% of its time in the kernel.  Unless you have similar
> statistics for the average lenght of your CPU's times in userspace,
> let's assume exactly 99%.
>
> On a non-nohz_full CPU, RCU's only hint that a CPU is executing in
> userspace occurs when the once-per-jiffy scheduling-clock interrupt
> is taken from userspace.  If the CPU is spending 99% of its time in
> the kernel, we can expect only one in 100 interrupts being taken from
> userspace.  On a HZ=1000 system, that will give you 100 milliseconds
> right there.
Thanks for the detailed explanation. It seemed my previous understanding
of RCU was wrong. My original though was that when one syscall returns
from the kernel space, the process which does the syscall will be
considered as being in quiescent state and an RCU grace period will be
expired soon.
>
> But it gets worse.  The variance is quite large.  For example, for a
> given grace period, there is about a 5% chance that the delay will be
> 300 milliseconds instead of the expected 100 milliseconds.  (You can
> easily check this by taking 0.99 to the 300th power.  Or whatever other
> power you desire.)
>
> So the problem is that simple statistics is biting pretty hard here.
>
> As you may have noticed, life is like that sometimes.  ;-)
>
> 							Thanx, Paul
>
>>> So 150 milliseconds is an OK RCU grace period.  A bit long, but not
>>> excessively so.  In contrast, by default in mainline, RCU starts seriously
>>> complaining if its grace period is extended beyond 21 *seconds*.  This is
>>> when the RCU CPU stall warning will appear.  (Yes, some Android configs
>>> are tuning this down to 20 milliseconds, but that is a special hardware
>>> and software configuration.)
>>>
>>> But if you want shorter RCU grace periods, what can you do?
>>>
>>> 1.	Follow Alexei's good advice on one of your early patches.
>>>
>>> 2.	As an alternative to scheduling an empty kworker, invoke something
>>> 	like rcu_momentary_dyntick_idle() periodically.  Note well that
>>> 	it is illegal to invoke this in an RCU read-side critical section,
>>> 	with preemption disabled, from idle, ...
>>>
>>> 3.	In non-preemptible kernels, cond_resched() is a much lighter
>>> 	weight alternative to rcu_momentary_dyntick_idle().  (Preemptible
>>> 	kernels get the same effect by preempting.  In your case, this
>>> 	is also true, but it takes 150 milliseconds.)
>>>
>>> That should do for a start.  ;-)
>>>
>>> 							Thanx, Paul
>>>
>>>> Regards,
>>>> Tao
>>>>> 							Thanx, Paul
>>>>>
>>>>>>>>> I don't think so. Let's considering the per-cpu list first. Assume the
>>>>>>>>> normal RCU grace period is about 30ms and we are tracing the IO latency
>>>>>>>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
>>>>>>>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
>>>>>>>>> For the per-ma list, when the number of CPUs increased, it is easy to
>>>>>>>>> make the list contain thousands of elements.
>>>>>>>> That would be true only if there were no scheduling events in all of 176K ops.
>>>>>>>> Which is not the case.
>>>>>>>> I'm not sure why you're saying that RCU GP is 30ms.
>>>>>> Because these freed elements will be freed after one RCU GP and in v4
>>>>>> there is only one RCU callback per-cpu, so before one RCU GP is expired,
>>>>>> these freed elements will be accumulated on the list.
>>>>>>>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
>>>>>>>> Every sched event is sort-of implicit rcu_read_lock/unlock.
>>>>>>>> Network and block IO doesn't process 176K packets without resched.
>>>>>>>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
>>>>>>>>
>>>>>>>> For small size buckets low_watermark=32 and high=96.
>>>>>>>> We typically move 32 elements at a time from one list to another.
>>>>>>>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
>>>>>>>> is not instant, but once __free_rcu_tasks_trace is called we need to take
>>>>>>>> elements from waiting_for_gp one at a time and kfree it one at a time.
>>>>>>>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
>>>>>>>>
>>>>>>>>> Before I post v5, I want to know the reason why per-bpf-ma list is
>>>>>>>>> introduced. Previously, I though it was used to handle the case in which
>>>>>>>>> allocation and freeing are done on different CPUs.
>>>>>>>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
>>>>>>>>
>>>>>>>>> And as we can see
>>>>>>>>> from the benchmark result above and in v3, the performance and the
>>>>>>>>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
>>>>>>>> bench from patch 2 is invalid. Hence no conclusion can be made.
>>>>>>>>
>>>>>>>> So far the only bench we can trust and analyze is map_perf_test.
>>>>>>>> Please make bench in patch 2 yield the cpu after few updates.
>>>>>>>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
>>>>>>>> 64 updates is realistic too. A thousand is not.
>>>>>> Will do that.
>>>>>>
>>> .





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux