Re: [PATCH 1/3] rcu: Use static initializer for krc.lock

Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> · Wed, 22 Apr 2020 13:13:24 +0200

On 2020-04-21 11:09:14 [-0700], Paul E. McKenney wrote:
> > Yes but why do we do this raw_spinlock_t here? It is not yet needed on
> > v5.6-RT as I *did* check. It also complicates the code for !RT but
> > nobody responded to that part but…
> 
> I did respond by pointing out that the essentially similar call_rcu()
> function ends up being invoked pretty much everywhere, including early
> boot before rcu_init() has been invoked.  It is therefore only reasonable
> to assume that there will be a need for kfree_rcu() to tolerate a similar
> range of calling contexts.

Early in the boot we have IRQs disabled but also one CPU and no
scheduling. That means that not a single lock is contained.

> > That said: the current memory allocation is the problem here. The
> > remaining part is fine. The part under the lock is small enough so it
> > should not cause the trouble if it invokes queue_work() which will
> > "only" enqueue the timer. 
> 
> To your point, the small memory allocation will be going away.  The
> memory allocations will pull in 4K pages of pointers.

Oki.

> On the timer, are you thinking of the queue_work() calls or instead of
> the queue_delayed_work() calls?

As of now, on the "first" invocation of kfree_rcu() it invokes
queue_delayed_work(). The work is not active, the timer is not pending
so it always enqueues a new timer.

> > Side question: Is there any real-life workloads that benefits from this?
> > I'm asking because rcuperf allocates the kfree_rcu() the pointer right
> > away. The chances are high that the pointer are fed from the same page.
> > SLUB's build_detached_freelist() scans the page of RCU's pointers to
> > ensure that they are in the same page and then slab_free() them in one
> > go. There is lookahead = 3 so it finds three different pages it stops
> > further scanning and does slab_free() with what it found so far.
> > 
> > Which means if your kfree_rcu() collects random pointer from the system,
> > they may belong to different pages (especially if they are part of
> > different "types").
> 
> It gets significantly better performance as it currently is due to
> the reduced cache-miss rate scanning pointers in a page as opposed to
> pointer-chasing through a series of rcu_head pointers.

So the performance boost is not due to kfree_bulk() but due to the
pointers which are "in ordered".

> Yes, it might be even better if kfree_rcu() further sorted the freed
> objects per slab or maybe even per page within a slab, but one step at
> a time!  For one thing, it is not hard to imagine situations where this
> further sorting actually slowed things down, especially if the system
> was under any sort of memory pressure or if the kfree_rcu() calls were
> scattered across so many slabs and pages that it essentially reverted
> back to pointer chasing.

Okay.

> Make sense, or am I missing your point?

No, you got it.

> 							Thanx, Paul

Sebastian