On Wed, Feb 14, 2024 at 10:54 AM Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote: > > On Mon, 2024-02-12 at 13:38 -0800, Suren Baghdasaryan wrote: > > Memory allocation, v3 and final: > > > > Overview: > > Low overhead [1] per-callsite memory allocation profiling. Not just for debug > > kernels, overhead low enough to be deployed in production. > > > > We're aiming to get this in the next merge window, for 6.9. The feedback > > we've gotten has been that even out of tree this patchset has already > > been useful, and there's a significant amount of other work gated on the > > code tagging functionality included in this patchset [2]. > > > > Example output: > > root@moria-kvm:~# sort -h /proc/allocinfo|tail > > 3.11MiB 2850 fs/ext4/super.c:1408 module:ext4 func:ext4_alloc_inode > > 3.52MiB 225 kernel/fork.c:356 module:fork func:alloc_thread_stack_node > > 3.75MiB 960 mm/page_ext.c:270 module:page_ext func:alloc_page_ext > > 4.00MiB 2 mm/khugepaged.c:893 module:khugepaged func:hpage_collapse_alloc_folio > > 10.5MiB 168 block/blk-mq.c:3421 module:blk_mq func:blk_mq_alloc_rqs > > 14.0MiB 3594 include/linux/gfp.h:295 module:filemap func:folio_alloc_noprof > > 26.8MiB 6856 include/linux/gfp.h:295 module:memory func:folio_alloc_noprof > > 64.5MiB 98315 fs/xfs/xfs_rmap_item.c:147 module:xfs func:xfs_rui_init > > 98.7MiB 25264 include/linux/gfp.h:295 module:readahead func:folio_alloc_noprof > > 125MiB 7357 mm/slub.c:2201 module:slub func:alloc_slab_page > > > > Since v2: > > - tglx noticed a circular header dependency between sched.h and percpu.h; > > a bunch of header cleanups were merged into 6.8 to ameliorate this [3]. > > > > - a number of improvements, moving alloc_hooks() annotations to the > > correct place for better tracking (mempool), and bugfixes. > > > > - looked at alternate hooking methods. > > There were suggestions on alternate methods (compiler attribute, > > trampolines), but they wouldn't have made the patchset any cleaner > > (we still need to have different function versions for accounting vs. no > > accounting to control at which point in a call chain the accounting > > happens), and they would have added a dependency on toolchain > > support. > > > > Usage: > > kconfig options: > > - CONFIG_MEM_ALLOC_PROFILING > > - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT > > - CONFIG_MEM_ALLOC_PROFILING_DEBUG > > adds warnings for allocations that weren't accounted because of a > > missing annotation > > > > sysctl: > > /proc/sys/vm/mem_profiling > > > > Runtime info: > > /proc/allocinfo > > > > Notes: > > > > [1]: Overhead > > To measure the overhead we are comparing the following configurations: > > (1) Baseline with CONFIG_MEMCG_KMEM=n > > (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && > > CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) > > (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && > > CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) > > (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y && > > CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1) > > (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT > > > > Thanks for the work on this patchset and it is quite useful. > A clarification question on the data: > > I assume Config (2), (3) and (4) has CONFIG_MEMCG_KMEM=n, right? Yes, correct. > If so do you have similar data for config (2), (3) and (4) but with > CONFIG_MEMCG_KMEM=y for comparison with (5)? I have data for these additional configs (didn't think there were that important): (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y > > Tim > > > Performance overhead: > > To evaluate performance we implemented an in-kernel test executing > > multiple get_free_page/free_page and kmalloc/kfree calls with allocation > > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU > > affinity set to a specific CPU to minimize the noise. Below are results > > from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on > > 56 core Intel Xeon: > > > > kmalloc pgalloc > > (1 baseline) 6.764s 16.902s > > (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) > > (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) > > (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) > > (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) (6 default disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) (7 default enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) (6) shows a bit better performance than (5) but it's probably noise. I would expect them to be roughly the same. Hope this helps. > > > >