On Thu, Mar 13, 2025 at 12:23 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: geez, my spelling today is really off even by my own "standard" > > On Thu, Mar 13, 2025 at 9:59 AM Harry Yoo <harry.yoo@xxxxxxxxxx> wrote: > > > There is other expensive work which can also be modified this way. > > > > Not sure what you're referring to? > > > > There is pgd_alloc/free calls which end up taking the global pgd_list > lock. Instead, this could be patched to let mm hang out on the get > skipped by the pgd list iterating machinery as needed. Instead, this could be patched to let mm hang out on the list and have iterations of said list skip it as needed. > > > > there may be spurious mm_struct's hanging out and eating pcpu resources. > > > Something can be added to reclaim those by the pcpu allocator. > > > > Not sure if I follow. What do you mean by spurious mm_struct, and how > > does the pcpu allocator reclaim that? > > > > Suppose a workload was ran which created tons of mm_struct. The > workload is done and they can be reclaimed, but hang out just in case. > > Another workload showed up, but one which wants to do many percpu > allocs and is not mm_struct-heavy. > > In case of resource shortage it would be good if the percpu allocator > knew how to reclaim the known cached-but-not-used memory instead of > grabbing new patches. new pages > > As for how to get there, so happens the primary consumer (percpu > counters) already has a global list of all allocated objects. The > allocator could walk it and reclaim as needed. > > > > So that's it for making the case, as for the APIs, I think it would be > > > best if both dtor and ctor accepted a batch of objects to operate on, > > > but that's a lot of extra churn due to pre-existing ctor users. > > > > Why do you want to pass batch of objects, instead of calling one by one > > for each object when a slab folio is allocated/freed? > > > > Is it solely to reduce the overhead of extra function calls when > > allocating or freeing a slab folio? > > > > The single-threaded overhead is one thing, some contention is another. > back-to-back acquires are a degenerate case from scalability > standpoint. > > While these codepaths should be executing rarely, there is still no > need to get spikes when they do. > > Even so, that's a minor thing which can be patched up later -- even a > ctor/dtor pair which operates one obj at a time is almost entirety of > the improvement. > > > > ACHTUNG: I think this particular usage would still want some buy in > > > from the mm folk and at least Dennis (the percpu allocator > > > maintainer), but one has to start somewhere. There were 2 different > > > patchsets posted to move rss counters away from the current pcpu > > > scheme, but both had different tradeoffs and ultimately died off. > > > > > > Should someone(tm) commit to sorting this out, I'll handle the percpu > > > thing. There are some other tweaks warranted here (e.g., depessimizing > > > the rss counter validation loop at exit). > > > > > > So what do you think? > > > > I'd love to take the project and work on it, it makes sense to revive > > the destructor feature if that turns out to be beneficial. > > > > I'll do that the slab part. > > Cool, thanks. > > -- > Mateusz Guzik <mjguzik gmail.com> -- Mateusz Guzik <mjguzik gmail.com>