On Fri, Mar 14, 2025 at 9:27 AM Harry Yoo <harry.yoo@xxxxxxxxxx> wrote: > > On Thu, Mar 13, 2025 at 12:23:55PM +0100, Mateusz Guzik wrote: > > On Thu, Mar 13, 2025 at 9:59 AM Harry Yoo <harry.yoo@xxxxxxxxxx> wrote: > > > > There is other expensive work which can also be modified this way. > > > > > > Not sure what you're referring to? > > > > > > > There is pgd_alloc/free calls which end up taking the global pgd_list > > lock. Instead, this could be patched to let mm hang out on the get > > skipped by the pgd list iterating machinery as needed. > > Okay, IIUC, you want to remove it from the list in the destructor > and not remove it from the list when freeing it, and the code iterating > over the list needs to verify whether it is still allocated. > > Would this be with SLAB_TYPESAFE_BY_RCU, or by taking a lock within > the destructor? > > The RFC patch [1] that removes the destructor states that "taking a spinlock > in a destructor is a bit risky since the slab allocators may run the > destructors anytime they decide a slab is no longer needed", > but I need to think a bit more about how risky that actually is. > > [1] https://lore.kernel.org/linux-kernel/Pine.LNX.4.64.0705101156190.10663@xxxxxxxxxxxxxxxxxxxxxxxxx/#t > It's a spinlock which disables interrupts around itself, so it should not be a problem. > > > > there may be spurious mm_struct's hanging out and eating pcpu resources. > > > > Something can be added to reclaim those by the pcpu allocator. > > > > > > Not sure if I follow. What do you mean by spurious mm_struct, and how > > > does the pcpu allocator reclaim that? > > > > > > > Suppose a workload was ran which created tons of mm_struct. The > > workload is done and they can be reclaimed, but hang out just in case. > > > > Another workload showed up, but one which wants to do many percpu > > allocs and is not mm_struct-heavy. > > > > In case of resource shortage it would be good if the percpu allocator > > knew how to reclaim the known cached-but-not-used memory instead of > > grabbing new patches. > > > > As for how to get there, so happens the primary consumer (percpu > > counters) already has a global list of all allocated objects. The > > allocator could walk it and reclaim as needed. > > You mean reclaiming per-cpu objects along withthe slab objects that uses them? > That sounds like a new slab shrinker for mm_struct? > at least the per-cpu thing, mm_struct itself optionally > > > > So that's it for making the case, as for the APIs, I think it would be > > > > best if both dtor and ctor accepted a batch of objects to operate on, > > > > but that's a lot of extra churn due to pre-existing ctor users. > > > > > > Why do you want to pass batch of objects, instead of calling one by one > > > for each object when a slab folio is allocated/freed? > > > > > > Is it solely to reduce the overhead of extra function calls when > > > allocating or freeing a slab folio? > > > > > > > The single-threaded overhead is one thing, some contention is another. > > back-to-back acquires are a degenerate case from scalability > > standpoint. > > > > While these codepaths should be executing rarely, there is still no > > need to get spikes when they do. > > > > Even so, that's a minor thing which can be patched up later -- even a > > ctor/dtor pair which operates one obj at a time is almost entirety of > > the improvement. > > Hmm, but I think even a ctor/dtor pair that operations on one object at > a time requires changing the semantic of how the ctor works :'( > > For now the constructor is not expected to fail, but looking at > mm_init()... it can fail. There is no way to let slab allocator know > that it failed. > > Looks like some churn is needed anyway? > indeed, i did not realize it at the time, sorry :) I'm happy to split the churn and add some 'return 0;' myself. -- Mateusz Guzik <mjguzik gmail.com>