The patch titled Subject: lib/stackdepot: add a refcount field in stack_record has been added to the -mm mm-unstable branch. Its filename is lib-stackdepot-add-a-refcount-field-in-stack_record.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/lib-stackdepot-add-a-refcount-field-in-stack_record.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Oscar Salvador <osalvador@xxxxxxx> Subject: lib/stackdepot: add a refcount field in stack_record Date: Tue, 16 May 2023 20:25:35 +0200 Patch series "page_owner: print stacks and their counter", v5. page_owner is a great debug functionality tool that gets us to know about all pages that have been allocated/freed and their stacktrace. This comes very handy when e.g: debugging leaks, as with some scripting we might be able to see those stacktraces that are allocating pages but not freeing theme. In my experience, that is one of the most useful cases, but it can get really tedious to screen through all pages aand try to reconstruct the stack <-> allocated/freed relationship. There is a lot of noise to cancel off. This patch aims to fix that by adding a new functionality into page_owner. What this does is to create a new read-only file "page_owner_stacks", which prints only the allocating stacktraces and their counting, being that the times the stacktrace has allocated - the times it has freed. So we have a clear overview of stacks <-> allocated/freed relationship without the need to fiddle with pages and trying to match free stacktraces with allocated stacktraces. This is achieved by adding a new refcount_t field in the stack_record struct, incrementing that refcount_t everytime the same stacktrace allocates, and decrementing it when it frees a page. Details can be seen in the respective patches. We also create another file called "page_owner_threshold", which let us specify a threshold, so when when reading from "page_owner_stacks", we will only see those stacktraces which counting goes beyond the threshold we specified. A PoC can be found below: # cat /sys/kernel/debug/page_owner_threshold 0 # cat /sys/kernel/debug/page_owner_stacks > stacks_full.txt # head -32 stacks_full.txt prep_new_page+0x10d/0x180 get_page_from_freelist+0x1bd6/0x1e10 __alloc_pages+0x194/0x360 alloc_page_interleave+0x13/0x90 new_slab+0x31d/0x530 ___slab_alloc+0x5d7/0x720 __slab_alloc.isra.85+0x4a/0x90 kmem_cache_alloc+0x455/0x4a0 acpi_ps_alloc_op+0x57/0x8f acpi_ps_create_scope_op+0x12/0x23 acpi_ps_execute_method+0x102/0x2c1 acpi_ns_evaluate+0x343/0x4da acpi_evaluate_object+0x1cb/0x392 acpi_run_osc+0x135/0x260 acpi_init+0x165/0x4ed do_one_initcall+0x3e/0x200 stack count: 2 free_pcp_prepare+0x287/0x5c0 free_unref_page+0x1c/0xd0 __mmdrop+0x50/0x160 finish_task_switch+0x249/0x2b0 __schedule+0x2c3/0x960 schedule+0x44/0xb0 futex_wait_queue+0x70/0xd0 futex_wait+0x160/0x250 do_futex+0x11c/0x1b0 __x64_sys_futex+0x5e/0x1d0 do_syscall_64+0x37/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd stack count: 1 # echo 10000 > /sys/kernel/debug/page_owner_threshold # cat /sys/kernel/debug/page_owner_stacks > stacks_10000.txt # cat stacks_10000.txt prep_new_page+0x10d/0x180 get_page_from_freelist+0x1bd6/0x1e10 __alloc_pages+0x194/0x360 folio_alloc+0x17/0x40 page_cache_ra_unbounded+0x96/0x170 filemap_get_pages+0x23d/0x5e0 filemap_read+0xbf/0x3a0 __kernel_read+0x136/0x2f0 kernel_read_file+0x197/0x2d0 kernel_read_file_from_fd+0x54/0x90 __do_sys_finit_module+0x89/0x120 do_syscall_64+0x37/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd stack count: 36195 prep_new_page+0x10d/0x180 get_page_from_freelist+0x1bd6/0x1e10 __alloc_pages+0x194/0x360 folio_alloc+0x17/0x40 page_cache_ra_unbounded+0x96/0x170 filemap_get_pages+0x23d/0x5e0 filemap_read+0xbf/0x3a0 new_sync_read+0x106/0x180 vfs_read+0x16f/0x190 ksys_read+0xa5/0xe0 do_syscall_64+0x37/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd stack count: 44484 prep_new_page+0x10d/0x180 get_page_from_freelist+0x1bd6/0x1e10 __alloc_pages+0x194/0x360 folio_alloc+0x17/0x40 page_cache_ra_unbounded+0x96/0x170 filemap_get_pages+0xdd/0x5e0 filemap_read+0xbf/0x3a0 new_sync_read+0x106/0x180 vfs_read+0x16f/0x190 ksys_read+0xa5/0xe0 do_syscall_64+0x37/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd stack count: 17874 This patch (of 3): We want to filter out page_owner output and print only those stacks that have been repeated beyond a certain threshold. This gives us the chance to get rid of a lot of noise. In order to do that, we need to keep track of how many repeated stacks (for allocation) do we have, so we add a new refcount_t field in the stack_record struct. Note that this might increase the size of the struct for some architectures. E.g: x86_64 is not affected due to alignment, but x86 32bits might. The alternative would be to have some kind of struct like this: struct track_stacks { struct stack_record *stack; struct track_stacks *next; refcount_t stack_count; But that would imply to perform more allocations and glue everything together, which would make the code more complex, so I think that going with a new field in the struct stack_record is good enough. Note that on __set_page_owner_handle(), page_owner->handle is set, and on __reset_page_owner(), page_owner->free_handle is set. We are interested in page_owner->handle, so when __set_page_owner() gets called, we derive the stack_record struct from page_owner->handle, and we increment its refcount_t field; and when __reset_page_owner() gets called, we derive its stack_record from page_owner->handle() and we decrement its refcount_t field. Link: https://lkml.kernel.org/r/20230516182537.3139-1-osalvador@xxxxxxx Link: https://lkml.kernel.org/r/20230516182537.3139-2-osalvador@xxxxxxx Signed-off-by: Oscar Salvador <osalvador@xxxxxxx> Cc: Alexander Potapenko <glider@xxxxxxxxxx> Cc: Andrey Konovalov <andreyknvl@xxxxxxxxx> Cc: Eric Dumazet <edumazet@xxxxxxxxxx> Cc: Marco Elver <elver@xxxxxxxxxx> Cc: Michal Hocko <mhocko@xxxxxxxx> Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx> Cc: Vlastimil Babka <vbabka@xxxxxxx> Cc: Waiman Long <longman@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- include/linux/stackdepot.h | 2 + lib/stackdepot.c | 53 ++++++++++++++++++++++++++++------- mm/page_owner.c | 6 +++ 3 files changed, 51 insertions(+), 10 deletions(-) --- a/include/linux/stackdepot.h~lib-stackdepot-add-a-refcount-field-in-stack_record +++ a/include/linux/stackdepot.h @@ -94,6 +94,8 @@ static inline int stack_depot_early_init depot_stack_handle_t __stack_depot_save(unsigned long *entries, unsigned int nr_entries, gfp_t gfp_flags, bool can_alloc); +void stack_depot_inc_count(depot_stack_handle_t handle); +void stack_depot_dec_count(depot_stack_handle_t handle); /** * stack_depot_save - Save a stack trace to stack depot --- a/lib/stackdepot.c~lib-stackdepot-add-a-refcount-field-in-stack_record +++ a/lib/stackdepot.c @@ -60,6 +60,7 @@ struct stack_record { u32 hash; /* Hash in the hash table */ u32 size; /* Number of stored frames */ union handle_parts handle; + refcount_t count; /* Number of the same repeated stacks */ unsigned long entries[]; /* Variable-sized array of frames */ }; @@ -305,6 +306,7 @@ depot_alloc_stack(unsigned long *entries stack->handle.offset = pool_offset >> DEPOT_STACK_ALIGN; stack->handle.valid = 1; stack->handle.extra = 0; + refcount_set(&stack->count, 1); memcpy(stack->entries, entries, flex_array_size(stack, entries, size)); pool_offset += required_size; /* @@ -457,8 +459,7 @@ depot_stack_handle_t stack_depot_save(un } EXPORT_SYMBOL_GPL(stack_depot_save); -unsigned int stack_depot_fetch(depot_stack_handle_t handle, - unsigned long **entries) +static struct stack_record *stack_depot_getstack(depot_stack_handle_t handle) { union handle_parts parts = { .handle = handle }; /* @@ -470,6 +471,26 @@ unsigned int stack_depot_fetch(depot_sta size_t offset = parts.offset << DEPOT_STACK_ALIGN; struct stack_record *stack; + if (!handle) + return NULL; + + if (parts.pool_index > pool_index_cached) { + WARN(1, "pool index %d out of bounds (%d) for stack id %08x\n", + parts.pool_index, pool_index_cached, handle); + return NULL; + } + pool = stack_pools[parts.pool_index]; + if (!pool) + return NULL; + stack = pool + offset; + return stack; +} + +unsigned int stack_depot_fetch(depot_stack_handle_t handle, + unsigned long **entries) +{ + struct stack_record *stack; + *entries = NULL; /* * Let KMSAN know *entries is initialized. This shall prevent false @@ -480,21 +501,33 @@ unsigned int stack_depot_fetch(depot_sta if (!handle) return 0; - if (parts.pool_index > pool_index_cached) { - WARN(1, "pool index %d out of bounds (%d) for stack id %08x\n", - parts.pool_index, pool_index_cached, handle); - return 0; - } - pool = stack_pools[parts.pool_index]; - if (!pool) + stack = stack_depot_getstack(handle); + if (!stack) return 0; - stack = pool + offset; *entries = stack->entries; return stack->size; } EXPORT_SYMBOL_GPL(stack_depot_fetch); +void stack_depot_inc_count(depot_stack_handle_t handle) +{ + struct stack_record *stack = NULL; + + stack = stack_depot_getstack(handle); + if (stack) + refcount_inc(&stack->count); +} + +void stack_depot_dec_count(depot_stack_handle_t handle) +{ + struct stack_record *stack = NULL; + + stack = stack_depot_getstack(handle); + if (stack) + refcount_dec(&stack->count); +} + void stack_depot_print(depot_stack_handle_t stack) { unsigned long *entries; --- a/mm/page_owner.c~lib-stackdepot-add-a-refcount-field-in-stack_record +++ a/mm/page_owner.c @@ -139,6 +139,7 @@ void __reset_page_owner(struct page *pag int i; struct page_ext *page_ext; depot_stack_handle_t handle; + depot_stack_handle_t alloc_handle; struct page_owner *page_owner; u64 free_ts_nsec = local_clock(); @@ -146,6 +147,9 @@ void __reset_page_owner(struct page *pag if (unlikely(!page_ext)) return; + page_owner = get_page_owner(page_ext); + alloc_handle = page_owner->handle; + handle = save_stack(GFP_NOWAIT | __GFP_NOWARN); for (i = 0; i < (1 << order); i++) { __clear_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags); @@ -155,6 +159,7 @@ void __reset_page_owner(struct page *pag page_ext = page_ext_next(page_ext); } page_ext_put(page_ext); + stack_depot_dec_count(alloc_handle); } static inline void __set_page_owner_handle(struct page_ext *page_ext, @@ -196,6 +201,7 @@ noinline void __set_page_owner(struct pa return; __set_page_owner_handle(page_ext, handle, order, gfp_mask); page_ext_put(page_ext); + stack_depot_inc_count(handle); } void __set_page_owner_migrate_reason(struct page *page, int reason) _ Patches currently in -mm which might be from osalvador@xxxxxxx are lib-stackdepot-add-a-refcount-field-in-stack_record.patch mm-page_owner-add-page_owner_stacks-file-to-print-out-only-stacks-and-their-counte.patch mmpage_owner-filter-out-stacks-by-a-threshold-counter.patch