On Mon, May 01, 2023 at 11:08:05AM -0700, Suren Baghdasaryan wrote: > On Mon, May 1, 2023 at 10:47 AM Roman Gushchin <roman.gushchin@xxxxxxxxx> wrote: > > > > On Mon, May 01, 2023 at 09:54:10AM -0700, Suren Baghdasaryan wrote: > > > Performance overhead: > > > To evaluate performance we implemented an in-kernel test executing > > > multiple get_free_page/free_page and kmalloc/kfree calls with allocation > > > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU > > > affinity set to a specific CPU to minimize the noise. Below is performance > > > comparison between the baseline kernel, profiling when enabled, profiling > > > when disabled (nomem_profiling=y) and (for comparison purposes) baseline > > > with CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT: > > > > > > kmalloc pgalloc > > > Baseline (6.3-rc7) 9.200s 31.050s > > > profiling disabled 9.800 (+6.52%) 32.600 (+4.99%) > > > profiling enabled 12.500 (+35.87%) 39.010 (+25.60%) > > > memcg_kmem enabled 41.400 (+350.00%) 70.600 (+127.38%) > > > > Hm, this makes me think we have a regression with memcg_kmem in one of > > the recent releases. When I measured it a couple of years ago, the overhead > > was definitely within 100%. > > > > Do you understand what makes the your profiling drastically faster than kmem? > > I haven't profiled or looked into kmem overhead closely but I can do > that. I just wanted to see how the overhead compares with the existing > accounting mechanisms. It's a good idea and I generally think that +25-35% for kmalloc/pgalloc should be ok for the production use, which is great! In the reality, most workloads are not that sensitive to the speed of memory allocation. > > For kmalloc, the overhead is low because after we create the vector of > slab_ext objects (which is the same as what memcg_kmem does), memory > profiling just increments a lazy counter (which in many cases would be > a per-cpu counter). So does kmem (this is why I'm somewhat surprised by the difference). > memcg_kmem operates on cgroup hierarchy with > additional overhead associated with that. I'm guessing that's the > reason for the big difference between these mechanisms but, I didn't > look into the details to understand memcg_kmem performance. I suspect recent rt-related changes and also the wide usage of rcu primitives in the kmem code. I'll try to look closer as well. Thanks!