On Mon, May 01, 2023 at 03:37:58PM -0400, Kent Overstreet wrote: > On Mon, May 01, 2023 at 11:14:45AM -0700, Roman Gushchin wrote: > > It's a good idea and I generally think that +25-35% for kmalloc/pgalloc > > should be ok for the production use, which is great! > > In the reality, most workloads are not that sensitive to the speed of > > memory allocation. > > :) > > My main takeaway has been "the slub fast path is _really_ fast". No > disabling of preemption, no atomic instructions, just a non locked > double word cmpxchg - it's a slick piece of work. > > > > For kmalloc, the overhead is low because after we create the vector of > > > slab_ext objects (which is the same as what memcg_kmem does), memory > > > profiling just increments a lazy counter (which in many cases would be > > > a per-cpu counter). > > > > So does kmem (this is why I'm somewhat surprised by the difference). > > > > > memcg_kmem operates on cgroup hierarchy with > > > additional overhead associated with that. I'm guessing that's the > > > reason for the big difference between these mechanisms but, I didn't > > > look into the details to understand memcg_kmem performance. > > > > I suspect recent rt-related changes and also the wide usage of > > rcu primitives in the kmem code. I'll try to look closer as well. > > Happy to give you something to compare against :) To be fair, it's not an apple-to-apple comparison, because: 1) memcgs are organized in a tree, these days usually with at least 3 layers, 2) memcgs are dynamic. In theory a task can be moved to a different memcg while performing a (very slow) allocation, and the original memcg can be released. To prevent this we have to perform a lot of operations which you can happily avoid. That said, there is clearly a place for optimization, so thank you for indirectly bringing this up. Thanks!