On Wed, Nov 01, 2023 at 04:35:05PM +0100, Uladzislau Rezki wrote: [...] > > Basically it does not work, because you do not fix the mixing "issue". > > I have been working on it and we agreed to separate it. Because it is > > just makes sense. The reason and the problem i see, i described in the > > commit message of v2. As I understand it, your point is "if we want synchronize_rcu() faster in all the cases, then a separate list makes a lot of sense since it won't be affected by different configs and etc.", right? If so, then understood. I don't have any problem on that your patch does a good work on making synchronize_rcu() faster, and actually I don't think my proposal necessarily blocks your patch. However, I wonder: why synchronize_rcu() is more special than other call_rcu_hurry() cases? Sure, synchronize_rcu() means blocking and unblocking ealier is always desirable, but call_rcu_hurry() could also queue a callback that wake up other thread, right? By making synchronize_rcu() faster, do we end up making other call_rcu_hurry() slow? So in my proposal, as you can see, I tried to be fair among all call_rcu_hurry() users, and hope that's a better solution from the whole kernel viewpoint. Also another fear I have is the following story: (Let say your improvement gets merged. And 6 months later) Someone shows up and say "hi, the synchronize_rcu() latency reduce work is great, but we have 1024 CPUs, so the single list in sr_normal_state becomes a bottleneck, synchronize_rcu() on some CPUs gets delayed by other CPU's synchronize_rcu(), can we make the list percpu?" (And 6 months later) Someone shows up and say "hi, the percpu list for low latency synchronize_rcu() is great, however, we want to save some CPU power by putting CPUs into groups and naming one leader of each group to handle synchronize_rcu() wakeups for the whole group, let's use kconfig CONFIG_RCU_NOSR_CPU to control that feature" Well, it's a story, so it may not happen, but I don't think the possiblity of totally re-inventing RCU callback lists and NOCB is 0 ;-) Anyway, I should stop being annoying here, I will use your test steps to check my idea, and will let you know! > > > > > > > > Do you have a benchmark I can try out to see if my diff can achieve the > > > similar result? Thanks! > > > > > There is no a good benchmark. But you can write it for sure. I tested > > three scenarios: > > > > - Run a camera app on our Android devices. Measuring app launch in > > milliseconds; > > - Doing synchronize_rcu() and kfree(ptr) simultaneously by 10K/etc > > workers. It is important test case because we have a fallback to > > this scenario for our kvfree_rcu_mightslepp() API. > > - I had a look at time delta of loading 100 kernel modules. > > > > That were my main test cases. > > > I will provide the patches and test steps, so you can try on. > Tomorrow i will send it! > Thanks! Regards, Boqun > -- > Uladzislau Rezki