Re: a case for a destructor for slub: mm_struct

Mateusz Guzik <mjguzik@xxxxxxxxx> · Fri, 14 Mar 2025 13:32:16 +0100

On Fri, Mar 14, 2025 at 9:27 AM Harry Yoo <harry.yoo@xxxxxxxxxx> wrote:
>
> On Thu, Mar 13, 2025 at 12:23:55PM +0100, Mateusz Guzik wrote:
> > On Thu, Mar 13, 2025 at 9:59 AM Harry Yoo <harry.yoo@xxxxxxxxxx> wrote:
> > > > There is other expensive work which can also be modified this way.
> > >
> > > Not sure what you're referring to?
> > >
> >
> > There is pgd_alloc/free calls which end up taking the global pgd_list
> > lock. Instead, this could be patched to let mm hang out on the get
> > skipped by the pgd list iterating machinery as needed.
>
> Okay, IIUC, you want to remove it from the list in the destructor
> and not remove it from the list when freeing it, and the code iterating
> over the list needs to verify whether it is still allocated.
>
> Would this be with SLAB_TYPESAFE_BY_RCU, or by taking a lock within
> the destructor?
>
> The RFC patch [1] that removes the destructor states that "taking a spinlock
> in a destructor is a bit risky since the slab allocators may run the
> destructors anytime they decide a slab is no longer needed",
> but I need to think a bit more about how risky that actually is.
>
> [1] https://lore.kernel.org/linux-kernel/Pine.LNX.4.64.0705101156190.10663@xxxxxxxxxxxxxxxxxxxxxxxxx/#t
>

It's a spinlock which disables interrupts around itself, so it should
not be a problem.

> > > > there may be spurious mm_struct's hanging out and eating pcpu resources.
> > > > Something can be added to reclaim those by the pcpu allocator.
> > >
> > > Not sure if I follow. What do you mean by spurious mm_struct, and how
> > > does the pcpu allocator reclaim that?
> > >
> >
> > Suppose a workload was ran which created tons of mm_struct. The
> > workload is done and they can be reclaimed, but hang out just in case.
> >
> > Another workload showed up, but one which wants to do many percpu
> > allocs and is not mm_struct-heavy.
> >
> > In case of resource shortage it would be good if the percpu allocator
> > knew how to reclaim the known cached-but-not-used memory instead of
> > grabbing new patches.
> >
> > As for how to get there, so happens the primary consumer (percpu
> > counters) already has a global list of all allocated objects. The
> > allocator could walk it and reclaim as needed.
>
> You mean reclaiming per-cpu objects along withthe slab objects that uses them?
> That sounds like a new slab shrinker for mm_struct?
>

at least the per-cpu thing, mm_struct itself optionally

> > > > So that's it for making the case, as for the APIs, I think it would be
> > > > best if both dtor and ctor accepted a batch of objects to operate on,
> > > > but that's a lot of extra churn due to pre-existing ctor users.
> > >
> > > Why do you want to pass batch of objects, instead of calling one by one
> > > for each object when a slab folio is allocated/freed?
> > >
> > > Is it solely to reduce the overhead of extra function calls when
> > > allocating or freeing a slab folio?
> > >
> >
> > The single-threaded overhead is one thing, some contention is another.
> > back-to-back acquires are a degenerate case from scalability
> > standpoint.
> >
> > While these codepaths should be executing rarely, there is still no
> > need to get spikes when they do.
> >
> > Even so, that's a minor thing which can be patched up later -- even a
> > ctor/dtor pair which operates one obj at a time is almost entirety of
> > the improvement.
>
> Hmm, but I think even a ctor/dtor pair that operations on one object at
> a time requires changing the semantic of how the ctor works :'(
>
> For now the constructor is not expected to fail, but looking at
> mm_init()... it can fail. There is no way to let slab allocator know
> that it failed.
>
> Looks like some churn is needed anyway?
>

indeed, i did not realize it at the time, sorry :)

I'm happy to split the churn and add some 'return 0;' myself.

-- 
Mateusz Guzik <mjguzik gmail.com>