Hi Mateusz, Thank you for providing detailed explanation on how destructor in slab allocator can be useful. A few questions and comments inlined below. On Wed, Mar 12, 2025 at 11:33:09AM +0100, Mateusz Guzik wrote: > I'm looking for someone(tm) willing to implement a destructor for slub. > > Currently SLUB only supports a constructor, a callback to use when > first creating an object, but there is no matching callback for > getting rid of it. > > The pair would come in handy when a frequently allocated and freed > object performs the same expensive work each time. Actually the destructor feature previously existed but removed by the commit c59def9f222d ("Slab allocators: Drop support for destructors"). But it was removed because there were not many users and uncertainty about its usefulness. > The specific usage I have in mind is mm_struct -- it gets allocated on > both each fork and exec and suffers global serialization several > times. > > The primary thing I'm looking to handle this way is cid and percpu > counter allocation, both going to down to the percpu allocator which > only has a global lock. The problem is exacerbated as it happens > back-to-back, so that's 4 acquires per lifetime cycle (alloc and > free). That could be beneficial :-) > There is other expensive work which can also be modified this way. Not sure what you're referring to? > I recognize something like this would pose a tradeoff in terms of > memory usage, but I don't believe it's a big deal. If you have a > mm_struct hanging out, you are going to need to have the percpu memory > up for grabs to make any use of it anyway. Granted, Yes. Some memory overhead is expected, but I don't think it'd be excessive. > there may be spurious mm_struct's hanging out and eating pcpu resources. > Something can be added to reclaim those by the pcpu allocator. Not sure if I follow. What do you mean by spurious mm_struct, and how does the pcpu allocator reclaim that? > So that's it for making the case, as for the APIs, I think it would be > best if both dtor and ctor accepted a batch of objects to operate on, > but that's a lot of extra churn due to pre-existing ctor users. Why do you want to pass batch of objects, instead of calling one by one for each object when a slab folio is allocated/freed? Is it solely to reduce the overhead of extra function calls when allocating or freeing a slab folio? > ACHTUNG: I think this particular usage would still want some buy in > from the mm folk and at least Dennis (the percpu allocator > maintainer), but one has to start somewhere. There were 2 different > patchsets posted to move rss counters away from the current pcpu > scheme, but both had different tradeoffs and ultimately died off. > > Should someone(tm) commit to sorting this out, I'll handle the percpu > thing. There are some other tweaks warranted here (e.g., depessimizing > the rss counter validation loop at exit). > > So what do you think? I'd love to take the project and work on it, it makes sense to revive the destructor feature if that turns out to be beneficial. I'll do that the slab part. > In order to bench yourself, you can grab code from here: > http://apollo.backplane.com/DFlyMisc/doexec.c > > $ cc -static -O2 -o static-doexec doexec.c > $ ./static-doexec $(nproc) > > I check spinlock problems with: bpftrace -e > 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }' Yay! I was looking for something like this to evaluate the performance. Thank you for providing it! -- Cheers, Harry (formerly known as Hyeonggon)