On 4/20/24 2:18 AM, David Rientjes wrote: > On Fri, 19 Apr 2024, Jianfeng Wang wrote: > >> diff --git a/mm/slub.c b/mm/slub.c >> index 1bb2a93cf7b6..993cbbdd2b6c 100644 >> --- a/mm/slub.c >> +++ b/mm/slub.c >> @@ -3213,6 +3213,43 @@ static inline bool free_debug_processing(struct kmem_cache *s, >> #endif /* CONFIG_SLUB_DEBUG */ >> >> #if defined(CONFIG_SLUB_DEBUG) || defined(SLAB_SUPPORTS_SYSFS) >> +#define MAX_PARTIAL_TO_SCAN 10000 >> + >> +static unsigned long count_partial_free_approx(struct kmem_cache_node *n) >> +{ >> + unsigned long flags; >> + unsigned long x = 0; >> + struct slab *slab; >> + >> + spin_lock_irqsave(&n->list_lock, flags); >> + if (n->nr_partial <= MAX_PARTIAL_TO_SCAN) { >> + list_for_each_entry(slab, &n->partial, slab_list) >> + x += slab->objects - slab->inuse; >> + } else { >> + /* >> + * For a long list, approximate the total count of objects in >> + * it to meet the limit on the number of slabs to scan. >> + * Scan from both the list's head and tail for better accuracy. >> + */ >> + unsigned long scanned = 0; >> + >> + list_for_each_entry(slab, &n->partial, slab_list) { >> + x += slab->objects - slab->inuse; >> + if (++scanned == MAX_PARTIAL_TO_SCAN / 2) >> + break; >> + } >> + list_for_each_entry_reverse(slab, &n->partial, slab_list) { >> + x += slab->objects - slab->inuse; >> + if (++scanned == MAX_PARTIAL_TO_SCAN) >> + break; >> + } >> + x = mult_frac(x, n->nr_partial, scanned); >> + x = min(x, node_nr_objs(n)); >> + } >> + spin_unlock_irqrestore(&n->list_lock, flags); >> + return x; >> +} > > Creative :) > > The default value of MAX_PARTIAL_TO_SCAN seems to work well in practice > while being large enough to bias for actual values? > > I can't think of a better way to avoid the disruption that very long > partial lists cause. If the actual value is needed, it will need to be > read from the sysfs file for that slab cache. > > It does beg the question of whether we want to extend slabinfo to indicate > that some fields are approximations, however. Adding a suffix such as > " : approx" to a slab cache line may be helpful if the disparity in the > estimates would actually make a difference in practice. I'm afraid that changing the layout of /proc/slabinfo has a much higher chance of breaking some consumer, than the imprecision due to approximation has. So I would rather not risk it. > I have a hard time believing that this approximation will not be "close > enough" for all practical purposes, given that the value could very well > substantially change the instant after the iteration is done anyway. > > So for that reason, this sounds good to me! > > Acked-by: David Rientjes <rientjes@xxxxxxxxxx> > >> + >> static unsigned long count_partial(struct kmem_cache_node *n, >> int (*get_count)(struct slab *)) >> { >> @@ -7089,7 +7126,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo) >> for_each_kmem_cache_node(s, node, n) { >> nr_slabs += node_nr_slabs(n); >> nr_objs += node_nr_objs(n); >> - nr_free += count_partial(n, count_free); >> + nr_free += count_partial_free_approx(n); >> } >> >> sinfo->active_objs = nr_objs - nr_free;