[PATCH RFC 0/6] SLUB percpu sheaves

Vlastimil Babka <vbabka@xxxxxxx> · Tue, 12 Nov 2024 17:38:44 +0100

Hi,

This is a RFC to add an opt-in percpu array-based caching layer to SLUB.
The name "sheaf" was invented by Matthew so we don't call it magazine
like the original Bonwick paper. The per-NUMA-node cache of sheaves is
thus called "barn".

This may seem similar to the arrays in SLAB, but the main differences
are:

- opt-in, not used for every cache
- does not distinguish NUMA locality, thus no "alien" arrays that would
  need periodical flushing
- improves kfree_rcu() handling
- API for obtaining a preallocated sheaf that can be used for guaranteed
  and efficient allocations in a restricted context, when upper bound is
  known but rarely reached

The motivation comes mainly from the ongoing work related to VMA
scalability and the related maple tree operations. This is why maple
tree node and vma caches are sheaf-enabled in the RFC. Performance benefits
were measured by Suren in preliminary non-public versions.

A sheaf-enabled cache has the following expected advantages:

- Cheaper fast paths. For allocations, instead of local double cmpxchg,
  with Patch 5 it's preempt_disable() and no atomic operations. Same for
  freeing, which is normally a local double cmpxchg only for a short
  term allocations (so the same slab is still active on the same cpu when
  freeing the object) and a more costly locked double cmpxchg otherwise.
  The downside is lack of NUMA locality guarantees for the allocated
  objects.

  I hope this scheme will also allow (non-guaranteed) slab allocations
  in context where it's impossible today and achieved by building caches
  on top of slab, i.e. the BPF allocator.

- kfree_rcu() batching. kfree_rcu() will put objects to a separate
  percpu sheaf and only submit the whole sheaf to call_rcu() when full.
  After the grace period, the sheaf can be used for allocations, which
  is more efficient than handling individual slab objects (even with the
  batching done by kfree_rcu() implementation itself). In case only some
  cpus are allowed to handle rcu callbacks, the sheaf can still be made
  available to other cpus on the same node via the shared barn.
  Both maple_node and vma caches can benefit from this.

- Preallocation support. A prefilled sheaf can be borrowed for a short
  term operation that is not allowed to block and may need to allocate
  some objects. If an upper bound (worst case) for the number of
  allocations is known, but only much fewer allocations actually needed
  on average, borrowing and returning a sheaf is much more efficient then
  a bulk allocation for the worst case followed by a bulk free of the
  many unused objects. Maple tree write operations should benefit from
  this.

Patch 1 implements the basic sheaf functionality and using
local_lock_irqsave() for percpu sheaf locking.

Patch 2 adds the kfree_rcu() support.

Patches 3 and 4 enable sheaves for maple tree nodes and vma's.

Patch 5 replaces the local_lock_irqsave() locking with a cheaper scheme
inspired by online conversations with Mateusz Guzik and Jann Horn. In
the past I have tried to copy the scheme from page allocator's pcplists
that also avoids disabling irqs by using a trylock for operations that
might be attempted from an irq handler conext. But spin locks used for
pcplists are more costly than a simple flag with only compiler barriers.
On the other hand it's not possible to take the lock from a different
cpu (except for hotplug handling when the actual local cpu cannot race
with us), but we don't need that remote locking for sheaves.

Patch 6 implements borrowing prefilled sheaf, with maple tree being the
ancticipated user once converted to use it by someone more knowledgeable
than myself.

(RFC) LIMITATIONS:

- with slub_debug enabled, objects in sheaves are considered allocated
  so allocation/free stacktraces may become imprecise and checking of
  e.g. redzone violations may be delayed

- kfree_rcu() via sheaf is only hooked to tree rcu, not tiny rcu. Also
  in case we fail to allocate a sheaf, and fallback to the existing
  implementation, it may use kfree_bulk() where destructors are not
  hooked. It's however possible we won't need the destructor support
  for now at all if vma_lock is moved to vma itself [1] and if it's
  possible to free anon_name and numa balancing tracking immediately
  and not after a grace period.

- in case a prefilled sheaf is requested with more objects than the
  cache's sheaf_capacity, it will fail. This should be possible to
  handle by allocating a bigger sheaf and then freeing it when returned,
  to avoid mixing up different sizes. Ineffective, but acceptable if
  very rare.

[1] https://lore.kernel.org/all/20241111205506.3404479-1-surenb@xxxxxxxxxx/

Vlastimil

git branch: https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v1r5

---
Vlastimil Babka (6):
      mm/slub: add opt-in caching layer of percpu sheaves
      mm/slub: add sheaf support for batching kfree_rcu() operations
      maple_tree: use percpu sheaves for maple_node_cache
      mm, vma: use sheaves for vm_area_struct cache
      mm, slub: cheaper locking for percpu sheaves
      mm, slub: sheaf prefilling for guaranteed allocations

 include/linux/slab.h |   60 +++
 kernel/fork.c        |   27 +-
 kernel/rcu/tree.c    |    8 +-
 lib/maple_tree.c     |   11 +-
 mm/slab.h            |   27 +
 mm/slab_common.c     |    8 +-
 mm/slub.c            | 1427 ++++++++++++++++++++++++++++++++++++++++++++++++--
 7 files changed, 1503 insertions(+), 65 deletions(-)
---
base-commit: 2d5404caa8c7bb5c4e0435f94b28834ae5456623
change-id: 20231128-slub-percpu-caches-9441892011d7

Best regards,
-- 
Vlastimil Babka <vbabka@xxxxxxx>