Re: [PATCH 1/3] rcu: Use static initializer for krc.lock

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Wed, 22 Apr 2020 06:33:46 -0700

On Wed, Apr 22, 2020 at 01:13:24PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-04-21 11:09:14 [-0700], Paul E. McKenney wrote:
> > > Yes but why do we do this raw_spinlock_t here? It is not yet needed on
> > > v5.6-RT as I *did* check. It also complicates the code for !RT but
> > > nobody responded to that part but…
> > 
> > I did respond by pointing out that the essentially similar call_rcu()
> > function ends up being invoked pretty much everywhere, including early
> > boot before rcu_init() has been invoked.  It is therefore only reasonable
> > to assume that there will be a need for kfree_rcu() to tolerate a similar
> > range of calling contexts.
> 
> Early in the boot we have IRQs disabled but also one CPU and no
> scheduling. That means that not a single lock is contained.

You are saying that call_rcu() is never invoked while holding a raw
spinlock?

> > > That said: the current memory allocation is the problem here. The
> > > remaining part is fine. The part under the lock is small enough so it
> > > should not cause the trouble if it invokes queue_work() which will
> > > "only" enqueue the timer. 
> > 
> > To your point, the small memory allocation will be going away.  The
> > memory allocations will pull in 4K pages of pointers.
> 
> Oki.
> 
> > On the timer, are you thinking of the queue_work() calls or instead of
> > the queue_delayed_work() calls?
> 
> As of now, on the "first" invocation of kfree_rcu() it invokes
> queue_delayed_work(). The work is not active, the timer is not pending
> so it always enqueues a new timer.

Fair enough, and thank you for checking!

The other work-queuing operations are also OK as they are, then?

> > > Side question: Is there any real-life workloads that benefits from this?
> > > I'm asking because rcuperf allocates the kfree_rcu() the pointer right
> > > away. The chances are high that the pointer are fed from the same page.
> > > SLUB's build_detached_freelist() scans the page of RCU's pointers to
> > > ensure that they are in the same page and then slab_free() them in one
> > > go. There is lookahead = 3 so it finds three different pages it stops
> > > further scanning and does slab_free() with what it found so far.
> > > 
> > > Which means if your kfree_rcu() collects random pointer from the system,
> > > they may belong to different pages (especially if they are part of
> > > different "types").
> > 
> > It gets significantly better performance as it currently is due to
> > the reduced cache-miss rate scanning pointers in a page as opposed to
> > pointer-chasing through a series of rcu_head pointers.
> 
> So the performance boost is not due to kfree_bulk() but due to the
> pointers which are "in ordered".

Almost.  Rather that the performance boost due to the cache locality
of the pointers justifies the change.  The performance boost due to
kfree_bulk() can sometimes help further, for example, when a large number
of objects from the same slab are freed.

Again, thank you for digging into this!

							Thanx, Paul

> > Yes, it might be even better if kfree_rcu() further sorted the freed
> > objects per slab or maybe even per page within a slab, but one step at
> > a time!  For one thing, it is not hard to imagine situations where this
> > further sorting actually slowed things down, especially if the system
> > was under any sort of memory pressure or if the kfree_rcu() calls were
> > scattered across so many slabs and pages that it essentially reverted
> > back to pointer chasing.
> 
> Okay.
> 
> > Make sense, or am I missing your point?
> 
> No, you got it.
> 
> > 							Thanx, Paul
> 
> Sebastian