On 4/11/24 6:40 PM, Jianfeng Wang wrote: > When reading "/proc/slabinfo", the kernel needs to report the number > of free objects for each kmem_cache. The current implementation relies > on count_partial() that counts the number of free objects by scanning > each kmem_cache_node's list of partial slabs and summing free objects > from every partial slab in the list. This process must hold per > kmem_cache_node spinlock and disable IRQ and may take a long time. > Consequently, it can block slab allocation requests on other CPU cores > and cause timeouts for network devices etc., when the partial slab list > is long. In production, even NMI watchdog can be triggered due to this > matter: e.g., for "buffer_head", the number of partial slabs was > observed to be ~1M in one kmem_cache_node. This problem was also Wonder what led to such a situation, whether it could have been the NUMA related problem this patch is fixing: https://lore.kernel.org/all/20240330082335.29710-1-chenjun102@xxxxxxxxxx/ But guess we don't do that with buffer heads, and it's just possible that a surge of allocations led to many slabs being allocated, and then the a slower trickle freeing didn't happen in a matching order, leading to many slabs becoming partially free... > confirmed by several others [1-3] in the past. > > Iterating a partial list to get the exact count of objects can cause > soft lockups for a long list with or without the lock (e.g., if > preemption is disabled), and is not very useful too: the object > count can change right after the lock is released. The approach of > maintaining free-object counters requires atomic operations on the > fast path [3]. > > So, the fix is to limit the number of slabs to scan in > count_partial(), and output an approximated result if the list is too > long. Default to 10000 which should be enough for most sane cases. > > [1] https://lore.kernel.org/linux-mm/ > alpine.DEB.2.21.2003031602460.1537@xxxxxxxxxxxxxxx/T/ > [2] https://lore.kernel.org/lkml/ > alpine.DEB.2.22.394.2008071258020.55871@xxxxxxxxxxxxxxx/T/ > [3] https://lore.kernel.org/lkml/ > 1e01092b-140d-2bab-aeba-321a74a194ee@xxxxxxxxx/T/ > > Signed-off-by: Jianfeng Wang <jianfeng.w.wang@xxxxxxxxxx> > --- > mm/slub.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/mm/slub.c b/mm/slub.c > index 1bb2a93cf7b6..5ed998ec7d6d 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -3213,16 +3213,25 @@ static inline bool free_debug_processing(struct kmem_cache *s, > #endif /* CONFIG_SLUB_DEBUG */ > > #if defined(CONFIG_SLUB_DEBUG) || defined(SLAB_SUPPORTS_SYSFS) > +#define MAX_PARTIAL_TO_SCAN 10000 > + > static unsigned long count_partial(struct kmem_cache_node *n, > int (*get_count)(struct slab *)) > { > unsigned long flags; > unsigned long x = 0; > + unsigned long scanned = 0; > struct slab *slab; > > spin_lock_irqsave(&n->list_lock, flags); > - list_for_each_entry(slab, &n->partial, slab_list) > + list_for_each_entry(slab, &n->partial, slab_list) { > x += get_count(slab); > + if (++scanned > MAX_PARTIAL_TO_SCAN) { > + /* Approximate total count of objects */ > + x = mult_frac(x, n->nr_partial, scanned); > + break; > + } > + } > spin_unlock_irqrestore(&n->list_lock, flags); > return x; > }