On Wed, Apr 22, 2020 at 01:13:24PM +0200, Sebastian Andrzej Siewior wrote: > On 2020-04-21 11:09:14 [-0700], Paul E. McKenney wrote: > > > Yes but why do we do this raw_spinlock_t here? It is not yet needed on > > > v5.6-RT as I *did* check. It also complicates the code for !RT but > > > nobody responded to that part but… > > > > I did respond by pointing out that the essentially similar call_rcu() > > function ends up being invoked pretty much everywhere, including early > > boot before rcu_init() has been invoked. It is therefore only reasonable > > to assume that there will be a need for kfree_rcu() to tolerate a similar > > range of calling contexts. > > Early in the boot we have IRQs disabled but also one CPU and no > scheduling. That means that not a single lock is contained. You are saying that call_rcu() is never invoked while holding a raw spinlock? > > > That said: the current memory allocation is the problem here. The > > > remaining part is fine. The part under the lock is small enough so it > > > should not cause the trouble if it invokes queue_work() which will > > > "only" enqueue the timer. > > > > To your point, the small memory allocation will be going away. The > > memory allocations will pull in 4K pages of pointers. > > Oki. > > > On the timer, are you thinking of the queue_work() calls or instead of > > the queue_delayed_work() calls? > > As of now, on the "first" invocation of kfree_rcu() it invokes > queue_delayed_work(). The work is not active, the timer is not pending > so it always enqueues a new timer. Fair enough, and thank you for checking! The other work-queuing operations are also OK as they are, then? > > > Side question: Is there any real-life workloads that benefits from this? > > > I'm asking because rcuperf allocates the kfree_rcu() the pointer right > > > away. The chances are high that the pointer are fed from the same page. > > > SLUB's build_detached_freelist() scans the page of RCU's pointers to > > > ensure that they are in the same page and then slab_free() them in one > > > go. There is lookahead = 3 so it finds three different pages it stops > > > further scanning and does slab_free() with what it found so far. > > > > > > Which means if your kfree_rcu() collects random pointer from the system, > > > they may belong to different pages (especially if they are part of > > > different "types"). > > > > It gets significantly better performance as it currently is due to > > the reduced cache-miss rate scanning pointers in a page as opposed to > > pointer-chasing through a series of rcu_head pointers. > > So the performance boost is not due to kfree_bulk() but due to the > pointers which are "in ordered". Almost. Rather that the performance boost due to the cache locality of the pointers justifies the change. The performance boost due to kfree_bulk() can sometimes help further, for example, when a large number of objects from the same slab are freed. Again, thank you for digging into this! Thanx, Paul > > Yes, it might be even better if kfree_rcu() further sorted the freed > > objects per slab or maybe even per page within a slab, but one step at > > a time! For one thing, it is not hard to imagine situations where this > > further sorting actually slowed things down, especially if the system > > was under any sort of memory pressure or if the kfree_rcu() calls were > > scattered across so many slabs and pages that it essentially reverted > > back to pointer chasing. > > Okay. > > > Make sense, or am I missing your point? > > No, you got it. > > > Thanx, Paul > > Sebastian