On Wed, Mar 12, 2025 at 7:58 AM Vlastimil Babka <vbabka@xxxxxxx> wrote: > > On 2/22/25 23:46, Suren Baghdasaryan wrote: > > On Fri, Feb 14, 2025 at 8:27 AM Vlastimil Babka <vbabka@xxxxxxx> wrote: > >> > >> Specifying a non-zero value for a new struct kmem_cache_args field > >> sheaf_capacity will setup a caching layer of percpu arrays called > >> sheaves of given capacity for the created cache. > >> > >> Allocations from the cache will allocate via the percpu sheaves (main or > >> spare) as long as they have no NUMA node preference. Frees will also > >> refill one of the sheaves. > >> > >> When both percpu sheaves are found empty during an allocation, an empty > >> sheaf may be replaced with a full one from the per-node barn. If none > >> are available and the allocation is allowed to block, an empty sheaf is > >> refilled from slab(s) by an internal bulk alloc operation. When both > >> percpu sheaves are full during freeing, the barn can replace a full one > >> with an empty one, unless over a full sheaves limit. In that case a > >> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing > >> sheaves and barns is also wired to the existing cpu flushing and cache > >> shrinking operations. > >> > >> The sheaves do not distinguish NUMA locality of the cached objects. If > >> an allocation is requested with kmem_cache_alloc_node() with a specific > >> node (not NUMA_NO_NODE), sheaves are bypassed. > >> > >> The bulk operations exposed to slab users also try to utilize the > >> sheaves as long as the necessary (full or empty) sheaves are available > >> on the cpu or in the barn. Once depleted, they will fallback to bulk > >> alloc/free to slabs directly to avoid double copying. > >> > >> Sysfs stat counters alloc_cpu_sheaf and free_cpu_sheaf count objects > >> allocated or freed using the sheaves. Counters sheaf_refill, > >> sheaf_flush_main and sheaf_flush_other count objects filled or flushed > >> from or to slab pages, and can be used to assess how effective the > >> caching is. The refill and flush operations will also count towards the > >> usual alloc_fastpath/slowpath, free_fastpath/slowpath and other > >> counters. > >> > >> Access to the percpu sheaves is protected by local_lock_irqsave() > >> operations, each per-NUMA-node barn has a spin_lock. > >> > >> A current limitation is that when slub_debug is enabled for a cache with > >> percpu sheaves, the objects in the array are considered as allocated from > >> the slub_debug perspective, and the alloc/free debugging hooks occur > >> when moving the objects between the array and slab pages. This means > >> that e.g. an use-after-free that occurs for an object cached in the > >> array is undetected. Collected alloc/free stacktraces might also be less > >> useful. This limitation could be changed in the future. > >> > >> On the other hand, KASAN, kmemcg and other hooks are executed on actual > >> allocations and frees by kmem_cache users even if those use the array, > >> so their debugging or accounting accuracy should be unaffected. > >> > >> Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx> > > > > Only one possible issue in __pcs_flush_all_cpu(), all other comments > > are nits and suggestions. > > Thanks. > > >> + * Limitations: when slub_debug is enabled for the cache, all relevant > >> + * actions (i.e. poisoning, obtaining stacktraces) and checks happen > >> + * when objects move between sheaves and slab pages, which may result in > >> + * e.g. not detecting a use-after-free while the object is in the array > >> + * cache, and the stacktraces may be less useful. > > > > I would also love to see a short comparison of sheaves (when objects > > are freed using kfree_rcu()) vs SLAB_TYPESAFE_BY_RCU. I think both > > mechanisms rcu-free objects in bulk but sheaves would not reuse an > > object before RCU grace period is passed. Is that right? > > I don't think that's right. SLAB_TYPESAFE_BY_RCU doesn't rcu-free objects in > bulk, the objects are freed immediately. It only rcu-delays freeing the slab > folio once all objects are freed. Yes, you are right. > > >> +struct slub_percpu_sheaves { > >> + local_lock_t lock; > >> + struct slab_sheaf *main; /* never NULL when unlocked */ > >> + struct slab_sheaf *spare; /* empty or full, may be NULL */ > >> + struct slab_sheaf *rcu_free; > > > > Would be nice to have a short comment for rcu_free as well. I could > > guess what main and spare are but for rcu_free had to look further. > > Added. > > >> +static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, > >> + size_t size, void **p); > >> + > >> + > >> +static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf, > >> + gfp_t gfp) > >> +{ > >> + int to_fill = s->sheaf_capacity - sheaf->size; > >> + int filled; > >> + > >> + if (!to_fill) > >> + return 0; > >> + > >> + filled = __kmem_cache_alloc_bulk(s, gfp, to_fill, > >> + &sheaf->objects[sheaf->size]); > >> + > >> + if (!filled) > >> + return -ENOMEM; > >> + > >> + sheaf->size = s->sheaf_capacity; > > > > nit: __kmem_cache_alloc_bulk() either allocates requested number of > > objects or returns 0, so the current code is fine but if at some point > > the implementation changes so that it can return smaller number of > > objects than requested (filled < to_fill) then the above assignment > > will become invalid. I think a safer thing here would be to just: > > > > sheaf->size += filled; > > > > which also makes logical sense. Alternatively you could add > > VM_BUG_ON(filled != to_fill) but the increment I think would be > > better. > > It's useful to indicate the refill was not successful, for patch 6. So I'm > changing this to: > > sheaf->size += filled; > > stat_add(s, SHEAF_REFILL, filled); > > if (filled < to_fill) > return -ENOMEM; > > return 0; That looks good to me. > > >> + > >> + stat_add(s, SHEAF_REFILL, filled); > >> + > >> + return 0; > >> +} > >> + > >> + > >> +static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp) > >> +{ > >> + struct slab_sheaf *sheaf = alloc_empty_sheaf(s, gfp); > >> + > >> + if (!sheaf) > >> + return NULL; > >> + > >> + if (refill_sheaf(s, sheaf, gfp)) { > >> + free_empty_sheaf(s, sheaf); > >> + return NULL; > >> + } > >> + > >> + return sheaf; > >> +} > >> + > >> +/* > >> + * Maximum number of objects freed during a single flush of main pcs sheaf. > >> + * Translates directly to an on-stack array size. > >> + */ > >> +#define PCS_BATCH_MAX 32U > >> + > > .> +static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t > > size, void **p); > >> + > > > > A comment clarifying why you are freeing in PCS_BATCH_MAX batches here > > would be helpful. My understanding is that you do that to free objects > > outside of the cpu_sheaves->lock, so you isolate a batch, release the > > lock and then free the batch. > > OK. > > >> +static void sheaf_flush_main(struct kmem_cache *s) > >> +{ > >> + struct slub_percpu_sheaves *pcs; > >> + unsigned int batch, remaining; > >> + void *objects[PCS_BATCH_MAX]; > >> + struct slab_sheaf *sheaf; > >> + unsigned long flags; > >> + > >> +next_batch: > >> + local_lock_irqsave(&s->cpu_sheaves->lock, flags); > >> + pcs = this_cpu_ptr(s->cpu_sheaves); > >> + sheaf = pcs->main; > >> + > >> + batch = min(PCS_BATCH_MAX, sheaf->size); > >> + > >> + sheaf->size -= batch; > >> + memcpy(objects, sheaf->objects + sheaf->size, batch * sizeof(void *)); > >> + > >> + remaining = sheaf->size; > >> + > >> + local_unlock_irqrestore(&s->cpu_sheaves->lock, flags); > >> + > >> + __kmem_cache_free_bulk(s, batch, &objects[0]); > >> + > >> + stat_add(s, SHEAF_FLUSH_MAIN, batch); > >> + > >> + if (remaining) > >> + goto next_batch; > >> +} > >> + > > > > This function seems to be used against either isolated sheaves or in > > slub_cpu_dead() --> __pcs_flush_all_cpu() path where we hold > > slab_mutex and I think that guarantees that the sheaf is unused. Maybe > > a short comment clarifying this requirement or rename the function to > > reflect that? Something like flush_unused_sheaf()? > > It's not slab_mutex, but the fact slub_cpu_dead() is executed in a hotplug > phase when the given cpu is already not executing anymore and thus cannot be > manipulating its percpu sheaves, so we are the only one that does. > So I will clarify and rename to sheaf_flush_unused(). I see. Thanks for explaining. > > >> + > >> +static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu) > >> +{ > >> + struct slub_percpu_sheaves *pcs; > >> + > >> + pcs = per_cpu_ptr(s->cpu_sheaves, cpu); > >> + > >> + if (pcs->spare) { > >> + sheaf_flush(s, pcs->spare); > >> + free_empty_sheaf(s, pcs->spare); > >> + pcs->spare = NULL; > >> + } > >> + > >> + // TODO: handle rcu_free > >> + BUG_ON(pcs->rcu_free); > >> + > >> + sheaf_flush_main(s); > > > > Hmm. sheaf_flush_main() always flushes for this_cpu only, so IIUC this > > call will not necessarily flush the main sheaf for the cpu passed to > > __pcs_flush_all_cpu(). > > Thanks, yes I need to call sheaf_flush_unused(pcs->main). It's ok to do > given my reply above. > > >> +/* > >> + * Free an object to the percpu sheaves. > >> + * The object is expected to have passed slab_free_hook() already. > >> + */ > >> +static __fastpath_inline > >> +void free_to_pcs(struct kmem_cache *s, void *object) > >> +{ > >> + struct slub_percpu_sheaves *pcs; > >> + unsigned long flags; > >> + > >> +restart: > >> + local_lock_irqsave(&s->cpu_sheaves->lock, flags); > >> + pcs = this_cpu_ptr(s->cpu_sheaves); > >> + > >> + if (unlikely(pcs->main->size == s->sheaf_capacity)) { > >> + > >> + struct slab_sheaf *empty; > >> + > >> + if (!pcs->spare) { > >> + empty = barn_get_empty_sheaf(pcs->barn); > >> + if (empty) { > >> + pcs->spare = pcs->main; > >> + pcs->main = empty; > >> + goto do_free; > >> + } > >> + goto alloc_empty; > >> + } > >> + > >> + if (pcs->spare->size < s->sheaf_capacity) { > >> + stat(s, SHEAF_SWAP); > >> + swap(pcs->main, pcs->spare); > >> + goto do_free; > >> + } > >> + > >> + empty = barn_replace_full_sheaf(pcs->barn, pcs->main); > >> + > >> + if (!IS_ERR(empty)) { > >> + pcs->main = empty; > >> + goto do_free; > >> + } > >> + > >> + if (PTR_ERR(empty) == -E2BIG) { > >> + /* Since we got here, spare exists and is full */ > >> + struct slab_sheaf *to_flush = pcs->spare; > >> + > >> + pcs->spare = NULL; > >> + local_unlock_irqrestore(&s->cpu_sheaves->lock, flags); > >> + > >> + sheaf_flush(s, to_flush); > >> + empty = to_flush; > >> + goto got_empty; > >> + } > >> + > >> +alloc_empty: > >> + local_unlock_irqrestore(&s->cpu_sheaves->lock, flags); > >> + > >> + empty = alloc_empty_sheaf(s, GFP_NOWAIT); > >> + > >> + if (!empty) { > >> + sheaf_flush_main(s); > >> + goto restart; > >> + } > >> + > >> +got_empty: > >> + local_lock_irqsave(&s->cpu_sheaves->lock, flags); > >> + pcs = this_cpu_ptr(s->cpu_sheaves); > >> + > >> + /* > >> + * if we put any sheaf to barn here, it's because we raced or > >> + * have been migrated to a different cpu, which should be rare > >> + * enough so just ignore the barn's limits to simplify > >> + */ > >> + if (unlikely(pcs->main->size < s->sheaf_capacity)) { > >> + if (!pcs->spare) > >> + pcs->spare = empty; > >> + else > >> + barn_put_empty_sheaf(pcs->barn, empty, true); > >> + goto do_free; > >> + } > >> + > >> + if (!pcs->spare) { > >> + pcs->spare = pcs->main; > >> + pcs->main = empty; > >> + goto do_free; > >> + } > >> + > >> + barn_put_full_sheaf(pcs->barn, pcs->main, true); > >> + pcs->main = empty; > > > > I find the program flow in this function quite complex and hard to > > follow. I think refactoring the above block starting from "pcs = > > this_cpu_ptr(s->cpu_sheaves)" would somewhat simplify it. That > > eliminates the need for the "got_empty" label and makes the > > locking/unlocking sequence of s->cpu_sheaves->lock a bit more clear. > > I'm a bit lost, refactoring how exactly? I thought moving the code above into a function above starting from "pcs = this_cpu_ptr(s->cpu_sheaves)" into its own function would simplify the flow. But as I said, it's a nit. If you try and don't like that feel free to ignore this suggestion. > > >> + } > >> + > >> +do_free: > >> + pcs->main->objects[pcs->main->size++] = object; > >> + > >> + local_unlock_irqrestore(&s->cpu_sheaves->lock, flags); > >> + > >> + stat(s, FREE_PCS); > >> +}