Hi Paul, On 6/8/2023 10:55 AM, Paul E. McKenney wrote: > On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote: >> Hi, >> >> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote: >>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov >>> <alexei.starovoitov@xxxxxxxxx> wrote: >>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote: >>>>> As said in the commit message, the command line for test is >>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If >>>>> using default max_entries and the number of CPUs is greater than 15, >>>>> use_percpu_counter will be false. >>>> Right. percpu or not depends on number of cpus. SNIP >>>> known, because I had just proposed it in the email yesterday. >>> Also noticed that the overhead of shared reuse_ready list >>> comes both from the contended lock and from cache misses >>> when one cpu pushes to the list after RCU GP and another >>> cpu removes. >>> >>> Also low/batch/high watermark are all wrong in patch 3. >>> low=32 and high=96 makes no sense when it's not a single list. >>> I'm experimenting with 32 for all three heuristics. >>> >>> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu >>> are redundant. >>> unit_free() can push into free_by_rcu directly >>> then reuse_bulk() can fill it up with free_llist_extra and >>> move them into waiting_for_gp. >> Yes. Indeed missing that. >>> All these _tail optimizations are obscuring the code and make it hard >>> to notice these issues. >>> >>>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization. >>>> >>>> Answering your other email: >>>> >>>>> I see your point. I will continue to debug the memory usage difference >>>>> between v3 and v4. >>>> imo it's a waste of time to continue analyzing performance based on bench in patch 2. >> Don't agree with that. I still think the potential memory usage of v4 is >> a problem and the difference memory usage between v3 and v4 reveals that >> there is some peculiarity in RCU subsystem, because the difference comes >> from the duration of RCU grace period. We need to find out why and fix >> or workaround that. > A tight loop in the kernel can extend RCU grace periods, especially > for kernels built with CONFIG_PREEPTION=n. Placing things like > cond_resched() in such loops can help. Of course, if you are in an > RCU read-side critical section (or holding a spinlock), you will need > to exit, cond_resched(), then re-enter. Taking care to ensure that the > state upon re-entry is valid. After all, having exited either type of > critical section, anything might happen. As said in the help-wanted email just send out, it was weird that the RCU grace period was extended a lot, up to ~150ms or more. But queue a dummy kworker periodically which does nothing will help to reduce the grace period to ~10ms. I have explained the context of the problem in that email. And hope that we could get some help from you and the RCU experts in the community. Regards, Tao > > Thanx, Paul > >>>>> I don't think so. Let's considering the per-cpu list first. Assume the >>>>> normal RCU grace period is about 30ms and we are tracing the IO latency >>>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU >>>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements. >>>>> For the per-ma list, when the number of CPUs increased, it is easy to >>>>> make the list contain thousands of elements. >>>> That would be true only if there were no scheduling events in all of 176K ops. >>>> Which is not the case. >>>> I'm not sure why you're saying that RCU GP is 30ms. >> Because these freed elements will be freed after one RCU GP and in v4 >> there is only one RCU callback per-cpu, so before one RCU GP is expired, >> these freed elements will be accumulated on the list. >>>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops. >>>> Every sched event is sort-of implicit rcu_read_lock/unlock. >>>> Network and block IO doesn't process 176K packets without resched. >>>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq. >>>> >>>> For small size buckets low_watermark=32 and high=96. >>>> We typically move 32 elements at a time from one list to another. >>>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp >>>> is not instant, but once __free_rcu_tasks_trace is called we need to take >>>> elements from waiting_for_gp one at a time and kfree it one at a time. >>>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity. >>>> >>>>> Before I post v5, I want to know the reason why per-bpf-ma list is >>>>> introduced. Previously, I though it was used to handle the case in which >>>>> allocation and freeing are done on different CPUs. >>>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP. >>>> >>>>> And as we can see >>>>> from the benchmark result above and in v3, the performance and the >>>>> memory usage of v4 for add_del_on_diff_cpu is better than v3. >>>> bench from patch 2 is invalid. Hence no conclusion can be made. >>>> >>>> So far the only bench we can trust and analyze is map_perf_test. >>>> Please make bench in patch 2 yield the cpu after few updates. >>>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time. >>>> 64 updates is realistic too. A thousand is not. >> Will do that. >>