On 2020-04-21 11:09:14 [-0700], Paul E. McKenney wrote: > > Yes but why do we do this raw_spinlock_t here? It is not yet needed on > > v5.6-RT as I *did* check. It also complicates the code for !RT but > > nobody responded to that part but… > > I did respond by pointing out that the essentially similar call_rcu() > function ends up being invoked pretty much everywhere, including early > boot before rcu_init() has been invoked. It is therefore only reasonable > to assume that there will be a need for kfree_rcu() to tolerate a similar > range of calling contexts. Early in the boot we have IRQs disabled but also one CPU and no scheduling. That means that not a single lock is contained. > > That said: the current memory allocation is the problem here. The > > remaining part is fine. The part under the lock is small enough so it > > should not cause the trouble if it invokes queue_work() which will > > "only" enqueue the timer. > > To your point, the small memory allocation will be going away. The > memory allocations will pull in 4K pages of pointers. Oki. > On the timer, are you thinking of the queue_work() calls or instead of > the queue_delayed_work() calls? As of now, on the "first" invocation of kfree_rcu() it invokes queue_delayed_work(). The work is not active, the timer is not pending so it always enqueues a new timer. > > Side question: Is there any real-life workloads that benefits from this? > > I'm asking because rcuperf allocates the kfree_rcu() the pointer right > > away. The chances are high that the pointer are fed from the same page. > > SLUB's build_detached_freelist() scans the page of RCU's pointers to > > ensure that they are in the same page and then slab_free() them in one > > go. There is lookahead = 3 so it finds three different pages it stops > > further scanning and does slab_free() with what it found so far. > > > > Which means if your kfree_rcu() collects random pointer from the system, > > they may belong to different pages (especially if they are part of > > different "types"). > > It gets significantly better performance as it currently is due to > the reduced cache-miss rate scanning pointers in a page as opposed to > pointer-chasing through a series of rcu_head pointers. So the performance boost is not due to kfree_bulk() but due to the pointers which are "in ordered". > Yes, it might be even better if kfree_rcu() further sorted the freed > objects per slab or maybe even per page within a slab, but one step at > a time! For one thing, it is not hard to imagine situations where this > further sorting actually slowed things down, especially if the system > was under any sort of memory pressure or if the kfree_rcu() calls were > scattered across so many slabs and pages that it essentially reverted > back to pointer chasing. Okay. > Make sense, or am I missing your point? No, you got it. > Thanx, Paul Sebastian