Re: [PATCH 1/3] rcu: Reduce synchronize_rcu() waiting time

Boqun Feng <boqun.feng@xxxxxxxxx> · Wed, 1 Nov 2023 18:04:40 -0700

On Wed, Nov 01, 2023 at 04:35:05PM +0100, Uladzislau Rezki wrote:
[...]
> > Basically it does not work, because you do not fix the mixing "issue".
> > I have been working on it and we agreed to separate it. Because it is
> > just makes sense. The reason and the problem i see, i described in the
> > commit message of v2.

As I understand it, your point is "if we want synchronize_rcu() faster
in all the cases, then a separate list makes a lot of sense since it
won't be affected by different configs and etc.", right? If so, then
understood.

I don't have any problem on that your patch does a good work on making
synchronize_rcu() faster, and actually I don't think my proposal
necessarily blocks your patch. However, I wonder: why synchronize_rcu()
is more special than other call_rcu_hurry() cases? Sure,
synchronize_rcu() means blocking and unblocking ealier is always
desirable, but call_rcu_hurry() could also queue a callback that wake
up other thread, right? By making synchronize_rcu() faster, do we end up
making other call_rcu_hurry() slow? So in my proposal, as you can see,
I tried to be fair among all call_rcu_hurry() users, and hope that's
a better solution from the whole kernel viewpoint.

Also another fear I have is the following story:

(Let say your improvement gets merged. And 6 months later)

Someone shows up and say "hi, the synchronize_rcu() latency reduce work
is great, but we have 1024 CPUs, so the single list in sr_normal_state
becomes a bottleneck, synchronize_rcu() on some CPUs gets delayed by
other CPU's synchronize_rcu(), can we make the list percpu?"

(And 6 months later)

Someone shows up and say "hi, the percpu list for low latency
synchronize_rcu() is great, however, we want to save some CPU power by
putting CPUs into groups and naming one leader of each group to handle
synchronize_rcu() wakeups for the whole group, let's use kconfig
CONFIG_RCU_NOSR_CPU to control that feature"

Well, it's a story, so it may not happen, but I don't think the
possiblity of totally re-inventing RCU callback lists and NOCB is 0 ;-)

Anyway, I should stop being annoying here, I will use your test steps to
check my idea, and will let you know!

> > 
> > >
> > > Do you have a benchmark I can try out to see if my diff can achieve the
> > > similar result? Thanks!
> > > 
> > There is no a good benchmark. But you can write it for sure. I tested
> > three scenarios:
> > 
> > - Run a camera app on our Android devices. Measuring app launch in
> >   milliseconds;
> > - Doing synchronize_rcu() and kfree(ptr) simultaneously by 10K/etc
> >   workers. It is important test case because we have a fallback to
> >   this scenario for our kvfree_rcu_mightslepp() API.
> > - I had a look at time delta of loading 100 kernel modules.
> > 
> > That were my main test cases.
> > 
> I will provide the patches and test steps, so you can try on.
> Tomorrow i will send it!
> 

Thanks!

Regards,
Boqun

> --
> Uladzislau Rezki