On Mon, Jun 12, 2023 at 10:03:15AM +0800, Hou Tao wrote: > Hi Paul, > > On 6/9/2023 10:30 PM, Paul E. McKenney wrote: > > On Fri, Jun 09, 2023 at 11:02:59AM +0800, Hou Tao wrote: > >> Hi Paul, > >> > >> On 6/9/2023 12:18 AM, Paul E. McKenney wrote: > >>> On Thu, Jun 08, 2023 at 11:43:54AM +0800, Hou Tao wrote: > >>>> Hi Paul, > >>>> > >>>> On 6/8/2023 10:55 AM, Paul E. McKenney wrote: > >>>>> On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote: > >>>>>> Hi, > >>>>>> > >>>>>> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote: > >>>>>>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov > >>>>>>> <alexei.starovoitov@xxxxxxxxx> wrote: > >>>>>>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote: > >>>>>>>>> As said in the commit message, the command line for test is > >>>>>>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If > >>>>>>>>> using default max_entries and the number of CPUs is greater than 15, > >>>>>>>>> use_percpu_counter will be false. > >>>>>>>> Right. percpu or not depends on number of cpus. > >>>> SNIP > >>>>>>>> known, because I had just proposed it in the email yesterday. > >>>>>>> Also noticed that the overhead of shared reuse_ready list > >>>>>>> comes both from the contended lock and from cache misses > >>>>>>> when one cpu pushes to the list after RCU GP and another > >>>>>>> cpu removes. > >>>>>>> > >>>>>>> Also low/batch/high watermark are all wrong in patch 3. > >>>>>>> low=32 and high=96 makes no sense when it's not a single list. > >>>>>>> I'm experimenting with 32 for all three heuristics. > >>>>>>> > >>>>>>> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu > >>>>>>> are redundant. > >>>>>>> unit_free() can push into free_by_rcu directly > >>>>>>> then reuse_bulk() can fill it up with free_llist_extra and > >>>>>>> move them into waiting_for_gp. > >>>>>> Yes. Indeed missing that. > >>>>>>> All these _tail optimizations are obscuring the code and make it hard > >>>>>>> to notice these issues. > >>>>>>> > >>>>>>>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization. > >>>>>>>> > >>>>>>>> Answering your other email: > >>>>>>>> > >>>>>>>>> I see your point. I will continue to debug the memory usage difference > >>>>>>>>> between v3 and v4. > >>>>>>>> imo it's a waste of time to continue analyzing performance based on bench in patch 2. > >>>>>> Don't agree with that. I still think the potential memory usage of v4 is > >>>>>> a problem and the difference memory usage between v3 and v4 reveals that > >>>>>> there is some peculiarity in RCU subsystem, because the difference comes > >>>>>> from the duration of RCU grace period. We need to find out why and fix > >>>>>> or workaround that. > >>>>> A tight loop in the kernel can extend RCU grace periods, especially > >>>>> for kernels built with CONFIG_PREEPTION=n. Placing things like > >>>>> cond_resched() in such loops can help. Of course, if you are in an > >>>>> RCU read-side critical section (or holding a spinlock), you will need > >>>>> to exit, cond_resched(), then re-enter. Taking care to ensure that the > >>>>> state upon re-entry is valid. After all, having exited either type of > >>>>> critical section, anything might happen. > >>>> As said in the help-wanted email just send out, it was weird that the > >>>> RCU grace period was extended a lot, up to ~150ms or more. But queue a > >>>> dummy kworker periodically which does nothing will help to reduce the > >>>> grace period to ~10ms. I have explained the context of the problem in > >>>> that email. And hope that we could get some help from you and the RCU > >>>> experts in the community. > >>> OK, I will bite... Why do you think this is weird? > >>> > >>> RCU depends on context switches, among other things. If you have a > >>> long-running in-kernel task, that will naturally extend the grace period. > >>> Scheduling the empty worker provided the context switch that RCU needed > >>> to make forward progress. > >> Because the empty kwork trick also works for CONFIG_PREEMPT_VOLUNTARY=y. > >> And there is neither implicit or explicit calling of schedule() in the > >> kernel space of producer thread when CONFIG_PREEMPT_VOLUNTARY=y. > >> And also I don't know how to define a long-running in-kernel task, > >> because I have checked the latency of getpgid() syscall in producer > >> thread when CONFIG_PREEMPT_VOLUNTARY=y . As shown in the text below, > >> there are indeed some outliers, but the most latency is less than 1ms, > >> so the producer thread will do contex_switch in at most 1ms. But before > >> the empty kwork trick, the latency of RCU callback is about 100ms or > >> more, and after the empty kwork trick, the latenct for RCU callback is > >> reduced to ~8ms. > >> > >> @hist_us: > >> [128, 256) 60 > >> | | > >> [256, 512) 101364 > >> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > >> [512, 1K) 17580 > >> |@@@@@@@@@ | > >> [1K, 2K) 16 > >> | | > >> > >> @stat_us: count 119020, average 436, total 51957277 > > Ah! > > > > The trick is that if you have a non-nohz_full CPU spending most of its > > time in kernel space (in your case, due to repeated system calls), with > > high probability this CPU will appear to RCU as if it were spending all > > of its time in kernel space. This probability will increase with the > > fraction of the time this CPU spends in kernel space. > > > > If I am reading your histogram correctly, the overhead of your system > > call is usually etween 256 and 512 microseconds. If you are invoking > > that system call in a tight loop, that CPU might could easily be spending > > way more than 99% of its time in the kernel. Unless you have similar > > statistics for the average lenght of your CPU's times in userspace, > > let's assume exactly 99%. > > > > On a non-nohz_full CPU, RCU's only hint that a CPU is executing in > > userspace occurs when the once-per-jiffy scheduling-clock interrupt > > is taken from userspace. If the CPU is spending 99% of its time in > > the kernel, we can expect only one in 100 interrupts being taken from > > userspace. On a HZ=1000 system, that will give you 100 milliseconds > > right there. > > Thanks for the detailed explanation. It seemed my previous understanding > of RCU was wrong. My original though was that when one syscall returns > from the kernel space, the process which does the syscall will be > considered as being in quiescent state and an RCU grace period will be > expired soon. On a nohz_full CPU, that does in fact happen, courtesy of the context-tracking code's ct_user_enter() and ct_user_exit() functions. So your understanding was not entirely wrong. But context tracking adds overhead, so it is not enabled on kernels not requiring it. Such kernels run certain types of HPC and/or real-time workloads. On other kernels, there is no context tracking for user/kernel transitions, which (as discussed above) forces RCU to rely on the scheduling clock interrupt, with the resulting grace-period delays. Thanx, Paul > > But it gets worse. The variance is quite large. For example, for a > > given grace period, there is about a 5% chance that the delay will be > > 300 milliseconds instead of the expected 100 milliseconds. (You can > > easily check this by taking 0.99 to the 300th power. Or whatever other > > power you desire.) > > > > So the problem is that simple statistics is biting pretty hard here. > > > > As you may have noticed, life is like that sometimes. ;-) > > > > Thanx, Paul > > > >>> So 150 milliseconds is an OK RCU grace period. A bit long, but not > >>> excessively so. In contrast, by default in mainline, RCU starts seriously > >>> complaining if its grace period is extended beyond 21 *seconds*. This is > >>> when the RCU CPU stall warning will appear. (Yes, some Android configs > >>> are tuning this down to 20 milliseconds, but that is a special hardware > >>> and software configuration.) > >>> > >>> But if you want shorter RCU grace periods, what can you do? > >>> > >>> 1. Follow Alexei's good advice on one of your early patches. > >>> > >>> 2. As an alternative to scheduling an empty kworker, invoke something > >>> like rcu_momentary_dyntick_idle() periodically. Note well that > >>> it is illegal to invoke this in an RCU read-side critical section, > >>> with preemption disabled, from idle, ... > >>> > >>> 3. In non-preemptible kernels, cond_resched() is a much lighter > >>> weight alternative to rcu_momentary_dyntick_idle(). (Preemptible > >>> kernels get the same effect by preempting. In your case, this > >>> is also true, but it takes 150 milliseconds.) > >>> > >>> That should do for a start. ;-) > >>> > >>> Thanx, Paul > >>> > >>>> Regards, > >>>> Tao > >>>>> Thanx, Paul > >>>>> > >>>>>>>>> I don't think so. Let's considering the per-cpu list first. Assume the > >>>>>>>>> normal RCU grace period is about 30ms and we are tracing the IO latency > >>>>>>>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU > >>>>>>>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements. > >>>>>>>>> For the per-ma list, when the number of CPUs increased, it is easy to > >>>>>>>>> make the list contain thousands of elements. > >>>>>>>> That would be true only if there were no scheduling events in all of 176K ops. > >>>>>>>> Which is not the case. > >>>>>>>> I'm not sure why you're saying that RCU GP is 30ms. > >>>>>> Because these freed elements will be freed after one RCU GP and in v4 > >>>>>> there is only one RCU callback per-cpu, so before one RCU GP is expired, > >>>>>> these freed elements will be accumulated on the list. > >>>>>>>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops. > >>>>>>>> Every sched event is sort-of implicit rcu_read_lock/unlock. > >>>>>>>> Network and block IO doesn't process 176K packets without resched. > >>>>>>>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq. > >>>>>>>> > >>>>>>>> For small size buckets low_watermark=32 and high=96. > >>>>>>>> We typically move 32 elements at a time from one list to another. > >>>>>>>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp > >>>>>>>> is not instant, but once __free_rcu_tasks_trace is called we need to take > >>>>>>>> elements from waiting_for_gp one at a time and kfree it one at a time. > >>>>>>>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity. > >>>>>>>> > >>>>>>>>> Before I post v5, I want to know the reason why per-bpf-ma list is > >>>>>>>>> introduced. Previously, I though it was used to handle the case in which > >>>>>>>>> allocation and freeing are done on different CPUs. > >>>>>>>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP. > >>>>>>>> > >>>>>>>>> And as we can see > >>>>>>>>> from the benchmark result above and in v3, the performance and the > >>>>>>>>> memory usage of v4 for add_del_on_diff_cpu is better than v3. > >>>>>>>> bench from patch 2 is invalid. Hence no conclusion can be made. > >>>>>>>> > >>>>>>>> So far the only bench we can trust and analyze is map_perf_test. > >>>>>>>> Please make bench in patch 2 yield the cpu after few updates. > >>>>>>>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time. > >>>>>>>> 64 updates is realistic too. A thousand is not. > >>>>>> Will do that. > >>>>>> > >>> . >