Re: [PATCH 1/2] stackdepot: use variable size records for non-evictable entries

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jan 25, 2024 at 10:48 AM Marco Elver <elver@xxxxxxxxxx> wrote:
>
> With the introduction of stack depot evictions, each stack record is now
> fixed size, so that future reuse after an eviction can safely store
> differently sized stack traces. In all cases that do not make use of
> evictions, this wastes lots of space.
>
> Fix it by re-introducing variable size stack records (up to the max
> allowed size) for entries that will never be evicted. We know if an
> entry will never be evicted if the flag STACK_DEPOT_FLAG_GET is not
> provided, since a later stack_depot_put() attempt is undefined behavior.
>
> With my current kernel config that enables KASAN and also SLUB owner tracking,
> I observe (after a kernel boot) a whopping reduction of 296 stack depot pools,
> which translates into 4736 KiB saved. The savings here are from SLUB owner
> tracking only, because KASAN generic mode still uses refcounting.
>
> Before:
>
>   pools: 893
>   allocations: 29841
>   frees: 6524
>   in_use: 23317
>   freelist_size: 3454
>
> After:
>
>   pools: 597
>   allocations: 29657
>   frees: 6425
>   in_use: 23232
>   freelist_size: 3493
>
> Fixes: 108be8def46e ("lib/stackdepot: allow users to evict stack traces")
> Signed-off-by: Marco Elver <elver@xxxxxxxxxx>
> Cc: Alexander Potapenko <glider@xxxxxxxxxx>
> Cc: Andrey Konovalov <andreyknvl@xxxxxxxxx>
> Cc: Dmitry Vyukov <dvyukov@xxxxxxxxxx>
> ---
> v1 (since RFC):
> * Get rid of new_pool_required to simplify the code.
> * Warn on attempts to switch a non-refcounted entry to refcounting.
> * Typos.
> ---
>  include/linux/poison.h |   3 +
>  lib/stackdepot.c       | 212 +++++++++++++++++++++--------------------
>  2 files changed, 113 insertions(+), 102 deletions(-)
>
> diff --git a/include/linux/poison.h b/include/linux/poison.h
> index 27a7dad17eef..1f0ee2459f2a 100644
> --- a/include/linux/poison.h
> +++ b/include/linux/poison.h
> @@ -92,4 +92,7 @@
>  /********** VFS **********/
>  #define VFS_PTR_POISON ((void *)(0xF5 + POISON_POINTER_DELTA))
>
> +/********** lib/stackdepot.c **********/
> +#define STACK_DEPOT_POISON ((void *)(0xD390 + POISON_POINTER_DELTA))
> +
>  #endif
> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> index 5caa1f566553..1b0d948a053c 100644
> --- a/lib/stackdepot.c
> +++ b/lib/stackdepot.c
> @@ -22,6 +22,7 @@
>  #include <linux/list.h>
>  #include <linux/mm.h>
>  #include <linux/mutex.h>
> +#include <linux/poison.h>
>  #include <linux/printk.h>
>  #include <linux/rculist.h>
>  #include <linux/rcupdate.h>
> @@ -93,9 +94,6 @@ struct stack_record {
>         };
>  };
>
> -#define DEPOT_STACK_RECORD_SIZE \
> -       ALIGN(sizeof(struct stack_record), 1 << DEPOT_STACK_ALIGN)
> -
>  static bool stack_depot_disabled;
>  static bool __stack_depot_early_init_requested __initdata = IS_ENABLED(CONFIG_STACKDEPOT_ALWAYS_INIT);
>  static bool __stack_depot_early_init_passed __initdata;
> @@ -121,15 +119,10 @@ static void *stack_pools[DEPOT_MAX_POOLS];
>  static void *new_pool;
>  /* Number of pools in stack_pools. */
>  static int pools_num;
> +/* Offset to the unused space in the currently used pool. */
> +static size_t pool_offset = DEPOT_POOL_SIZE;
>  /* Freelist of stack records within stack_pools. */
>  static LIST_HEAD(free_stacks);
> -/*
> - * Stack depot tries to keep an extra pool allocated even before it runs out
> - * of space in the currently used pool. This flag marks whether this extra pool
> - * needs to be allocated. It has the value 0 when either an extra pool is not
> - * yet allocated or if the limit on the number of pools is reached.
> - */
> -static bool new_pool_required = true;
>  /* The lock must be held when performing pool or freelist modifications. */
>  static DEFINE_RAW_SPINLOCK(pool_lock);
>
> @@ -294,48 +287,52 @@ int stack_depot_init(void)
>  EXPORT_SYMBOL_GPL(stack_depot_init);
>
>  /*
> - * Initializes new stack depot @pool, release all its entries to the freelist,
> - * and update the list of pools.
> + * Initializes new stack pool, and updates the list of pools.
>   */
> -static void depot_init_pool(void *pool)
> +static bool depot_init_pool(void **prealloc)
>  {
> -       int offset;
> -
>         lockdep_assert_held(&pool_lock);
>
> -       /* Initialize handles and link stack records into the freelist. */
> -       for (offset = 0; offset <= DEPOT_POOL_SIZE - DEPOT_STACK_RECORD_SIZE;
> -            offset += DEPOT_STACK_RECORD_SIZE) {
> -               struct stack_record *stack = pool + offset;
> -
> -               stack->handle.pool_index = pools_num;
> -               stack->handle.offset = offset >> DEPOT_STACK_ALIGN;
> -               stack->handle.extra = 0;
> -
> -               /*
> -                * Stack traces of size 0 are never saved, and we can simply use
> -                * the size field as an indicator if this is a new unused stack
> -                * record in the freelist.
> -                */
> -               stack->size = 0;
> +       if (unlikely(pools_num >= DEPOT_MAX_POOLS)) {
> +               /* Bail out if we reached the pool limit. */
> +               WARN_ON_ONCE(pools_num > DEPOT_MAX_POOLS); /* should never happen */
> +               WARN_ON_ONCE(!new_pool); /* to avoid unnecessary pre-allocation */
> +               WARN_ONCE(1, "Stack depot reached limit capacity");
> +               return false;
> +       }
>
> -               INIT_LIST_HEAD(&stack->hash_list);
> -               /*
> -                * Add to the freelist front to prioritize never-used entries:
> -                * required in case there are entries in the freelist, but their
> -                * RCU cookie still belongs to the current RCU grace period
> -                * (there can still be concurrent readers).
> -                */
> -               list_add(&stack->free_list, &free_stacks);
> -               counters[DEPOT_COUNTER_FREELIST_SIZE]++;
> +       if (!new_pool && *prealloc) {
> +               /* We have preallocated memory, use it. */
> +               WRITE_ONCE(new_pool, *prealloc);
> +               *prealloc = NULL;
>         }
>
> +       if (!new_pool)
> +               return false; /* new_pool and *prealloc are NULL */
> +
>         /* Save reference to the pool to be used by depot_fetch_stack(). */
> -       stack_pools[pools_num] = pool;
> +       stack_pools[pools_num] = new_pool;
> +
> +       /*
> +        * Stack depot tries to keep an extra pool allocated even before it runs
> +        * out of space in the currently used pool.
> +        *
> +        * To indicate that a new preallocation is needed new_pool is reset to
> +        * NULL; do not reset to NULL if we have reached the maximum number of
> +        * pools.
> +        */
> +       if (pools_num < DEPOT_MAX_POOLS)
> +               WRITE_ONCE(new_pool, NULL);
> +       else
> +               WRITE_ONCE(new_pool, STACK_DEPOT_POISON);
>
>         /* Pairs with concurrent READ_ONCE() in depot_fetch_stack(). */
>         WRITE_ONCE(pools_num, pools_num + 1);
>         ASSERT_EXCLUSIVE_WRITER(pools_num);
> +
> +       pool_offset = 0;
> +
> +       return true;
>  }
>
>  /* Keeps the preallocated memory to be used for a new stack depot pool. */
> @@ -347,60 +344,48 @@ static void depot_keep_new_pool(void **prealloc)
>          * If a new pool is already saved or the maximum number of
>          * pools is reached, do not use the preallocated memory.
>          */
> -       if (!new_pool_required)
> +       if (new_pool)
>                 return;
>
> -       /*
> -        * Use the preallocated memory for the new pool
> -        * as long as we do not exceed the maximum number of pools.
> -        */
> -       if (pools_num < DEPOT_MAX_POOLS) {
> -               new_pool = *prealloc;
> -               *prealloc = NULL;
> -       }
> -
> -       /*
> -        * At this point, either a new pool is kept or the maximum
> -        * number of pools is reached. In either case, take note that
> -        * keeping another pool is not required.
> -        */
> -       WRITE_ONCE(new_pool_required, false);
> +       WRITE_ONCE(new_pool, *prealloc);
> +       *prealloc = NULL;
>  }
>
>  /*
> - * Try to initialize a new stack depot pool from either a previous or the
> - * current pre-allocation, and release all its entries to the freelist.
> + * Try to initialize a new stack record from the current pool, a cached pool, or
> + * the current pre-allocation.
>   */
> -static bool depot_try_init_pool(void **prealloc)
> +static struct stack_record *depot_pop_free_pool(void **prealloc, size_t size)
>  {
> +       struct stack_record *stack;
> +       void *current_pool;
> +       u32 pool_index;
> +
>         lockdep_assert_held(&pool_lock);
>
> -       /* Check if we have a new pool saved and use it. */
> -       if (new_pool) {
> -               depot_init_pool(new_pool);
> -               new_pool = NULL;
> +       if (pool_offset + size > DEPOT_POOL_SIZE) {
> +               if (!depot_init_pool(prealloc))
> +                       return NULL;
> +       }
>
> -               /* Take note that we might need a new new_pool. */
> -               if (pools_num < DEPOT_MAX_POOLS)
> -                       WRITE_ONCE(new_pool_required, true);
> +       if (WARN_ON_ONCE(pools_num < 1))
> +               return NULL;
> +       pool_index = pools_num - 1;
> +       current_pool = stack_pools[pool_index];
> +       if (WARN_ON_ONCE(!current_pool))
> +               return NULL;
>
> -               return true;
> -       }
> +       stack = current_pool + pool_offset;
>
> -       /* Bail out if we reached the pool limit. */
> -       if (unlikely(pools_num >= DEPOT_MAX_POOLS)) {
> -               WARN_ONCE(1, "Stack depot reached limit capacity");
> -               return false;
> -       }
> +       /* Pre-initialize handle once. */
> +       stack->handle.pool_index = pool_index;
> +       stack->handle.offset = pool_offset >> DEPOT_STACK_ALIGN;
> +       stack->handle.extra = 0;
> +       INIT_LIST_HEAD(&stack->hash_list);
>
> -       /* Check if we have preallocated memory and use it. */
> -       if (*prealloc) {
> -               depot_init_pool(*prealloc);
> -               *prealloc = NULL;
> -               return true;
> -       }
> +       pool_offset += size;
>
> -       return false;
> +       return stack;
>  }
>
>  /* Try to find next free usable entry. */

Please update this to specifically mention the freelist. Otherwise,
it's hard to understand what's the difference compared to
depot_pop_free_pool without reading into the code.

> @@ -420,7 +405,7 @@ static struct stack_record *depot_pop_free(void)
>          * check the first entry.
>          */
>         stack = list_first_entry(&free_stacks, struct stack_record, free_list);
> -       if (stack->size && !poll_state_synchronize_rcu(stack->rcu_state))
> +       if (!poll_state_synchronize_rcu(stack->rcu_state))
>                 return NULL;
>
>         list_del(&stack->free_list);
> @@ -429,45 +414,68 @@ static struct stack_record *depot_pop_free(void)
>         return stack;
>  }
>
> +static inline size_t depot_stack_record_size(struct stack_record *s, unsigned int nr_entries)
> +{
> +       const size_t used = flex_array_size(s, entries, nr_entries);
> +       const size_t unused = sizeof(s->entries) - used;
> +
> +       WARN_ON_ONCE(sizeof(s->entries) < used);
> +
> +       return ALIGN(sizeof(struct stack_record) - unused, 1 << DEPOT_STACK_ALIGN);
> +}
> +
>  /* Allocates a new stack in a stack depot pool. */
>  static struct stack_record *
> -depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
> +depot_alloc_stack(unsigned long *entries, int nr_entries, u32 hash, depot_flags_t flags, void **prealloc)
>  {
> -       struct stack_record *stack;
> +       struct stack_record *stack = NULL;
> +       size_t record_size;
>
>         lockdep_assert_held(&pool_lock);
>
>         /* This should already be checked by public API entry points. */
> -       if (WARN_ON_ONCE(!size))
> +       if (WARN_ON_ONCE(!nr_entries))
>                 return NULL;
>
> -       /* Check if we have a stack record to save the stack trace. */
> -       stack = depot_pop_free();
> -       if (!stack) {
> -               /* No usable entries on the freelist - try to refill the freelist. */
> -               if (!depot_try_init_pool(prealloc))
> -                       return NULL;
> +       /* Limit number of saved frames to CONFIG_STACKDEPOT_MAX_FRAMES. */
> +       if (nr_entries > CONFIG_STACKDEPOT_MAX_FRAMES)
> +               nr_entries = CONFIG_STACKDEPOT_MAX_FRAMES;
> +
> +       if (flags & STACK_DEPOT_FLAG_GET) {
> +               /*
> +                * Evictable entries have to allocate the max. size so they may
> +                * safely be re-used by differently sized allocations.
> +                */
> +               record_size = depot_stack_record_size(stack, CONFIG_STACKDEPOT_MAX_FRAMES);
>                 stack = depot_pop_free();
> -               if (WARN_ON(!stack))
> -                       return NULL;
> +       } else {
> +               record_size = depot_stack_record_size(stack, nr_entries);
>         }
>
> -       /* Limit number of saved frames to CONFIG_STACKDEPOT_MAX_FRAMES. */
> -       if (size > CONFIG_STACKDEPOT_MAX_FRAMES)
> -               size = CONFIG_STACKDEPOT_MAX_FRAMES;
> +       if (!stack) {
> +               stack = depot_pop_free_pool(prealloc, record_size);
> +               if (!stack)
> +                       return NULL;
> +       }
>
>         /* Save the stack trace. */
>         stack->hash = hash;
> -       stack->size = size;
> -       /* stack->handle is already filled in by depot_init_pool(). */
> -       refcount_set(&stack->count, 1);
> -       memcpy(stack->entries, entries, flex_array_size(stack, entries, size));
> +       stack->size = nr_entries;
> +       /* stack->handle is already filled in by depot_pop_free_pool(). */
> +       memcpy(stack->entries, entries, flex_array_size(stack, entries, nr_entries));
> +
> +       if (flags & STACK_DEPOT_FLAG_GET) {
> +               refcount_set(&stack->count, 1);
> +       } else {
> +               /* Warn on attempts to switch to refcounting this entry. */
> +               refcount_set(&stack->count, REFCOUNT_SATURATED);
> +       }
>
>         /*
>          * Let KMSAN know the stored stack record is initialized. This shall
>          * prevent false positive reports if instrumented code accesses it.
>          */
> -       kmsan_unpoison_memory(stack, DEPOT_STACK_RECORD_SIZE);
> +       kmsan_unpoison_memory(stack, record_size);
>
>         counters[DEPOT_COUNTER_ALLOCS]++;
>         counters[DEPOT_COUNTER_INUSE]++;

I wonder if we should separate the stat counters for
evictable/non-evictable cases. For non-evictable, we could count the
amount of consumed memory.

> @@ -660,7 +668,7 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
>          * Allocate memory for a new pool if required now:
>          * we won't be able to do that under the lock.
>          */
> -       if (unlikely(can_alloc && READ_ONCE(new_pool_required))) {
> +       if (unlikely(can_alloc && !READ_ONCE(new_pool))) {
>                 /*
>                  * Zero out zone modifiers, as we don't have specific zone
>                  * requirements. Keep the flags related to allocation in atomic
> @@ -681,7 +689,7 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
>         found = find_stack(bucket, entries, nr_entries, hash, depot_flags);
>         if (!found) {
>                 struct stack_record *new =
> -                       depot_alloc_stack(entries, nr_entries, hash, &prealloc);
> +                       depot_alloc_stack(entries, nr_entries, hash, depot_flags, &prealloc);
>
>                 if (new) {
>                         /*
> --
> 2.43.0.429.g432eaa2c6b-goog
>

We can also now drop the special case for DEPOT_POOLS_CAP for KMSAN.

Otherwise, looks good to me.

Reviewed-by: Andrey Konovalov <andreyknvl@xxxxxxxxx>

Thank you for cleaning this up!





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux