On 4/16/24 8:58 PM, Jianfeng Wang wrote: > > > On 4/15/24 12:35 AM, Vlastimil Babka wrote: >> On 4/13/24 3:17 AM, Jianfeng Wang wrote: >>> >>>> On Apr 12, 2024, at 1:44 PM, Jianfeng Wang <jianfeng.w.wang@xxxxxxxxxx> wrote: >>>> >>>> On 4/12/24 1:20 PM, Vlastimil Babka wrote: >>>>> On 4/12/24 7:29 PM, Jianfeng Wang wrote: >>>>>> >>>>>> On 4/12/24 12:48 AM, Vlastimil Babka wrote: >>>>>>> On 4/11/24 7:02 PM, Christoph Lameter (Ampere) wrote: >>>>>>>> On Thu, 11 Apr 2024, Jianfeng Wang wrote: >>>>>>>> >>>>>>>>> So, the fix is to limit the number of slabs to scan in >>>>>>>>> count_partial(), and output an approximated result if the list is too >>>>>>>>> long. Default to 10000 which should be enough for most sane cases. >>>>>>>> >>>>>>>> >>>>>>>> That is a creative approach. The problem though is that objects on the >>>>>>>> partial lists are kind of sorted. The partial slabs with only a few >>>>>>>> objects available are at the start of the list so that allocations cause >>>>>>>> them to be removed from the partial list fast. Full slabs do not need to >>>>>>>> be tracked on any list. >>>>>>>> >>>>>>>> The partial slabs with few objects are put at the end of the partial list >>>>>>>> in the hope that the few objects remaining will also be freed which would >>>>>>>> allow the freeing of the slab folio. >>>>>>>> >>>>>>>> So the object density may be higher at the beginning of the list. >>>>>>>> >>>>>>>> kmem_cache_shrink() will explicitly sort the partial lists to put the >>>>>>>> partial pages in that order. >>>>>>>> >>> >>> Realized that I’d do "echo 1 > /sys/kernel/slab/dentry/shrink” to sort the list explicitly. >>> After that, the numbers become: >>> N = 10000 -> diff = 7.1 % >>> N = 20000 -> diff = 5.7 % >>> N = 25000 -> diff = 5.4 % >>> So, expecting ~5-7% difference after shrinking. >>> >>>>>>>> Can you run some tests showing the difference between the estimation and >>>>>>>> the real count? >>>>>> >>>>>> Yes. >>>>>> On a server with one NUMA node, I create a case that uses many dentry objects. >>>>> >>>>> Could you describe in more detail how do you make dentry cache to grow such >>>>> a large partial slabs list? Thanks. >>>>> >>>> >>>> I utilized the fact that creating a folder will create a new dentry object; >>>> deleting a folder will delete all its sub-folder's dentry objects. >>>> >>>> Then, I started to create N folders, while each folder has M empty sub-folders. >>>> Assuming that these operations would consume a large number of dentry >>>> objects in the sequential order. Their slabs were very likely to be full slabs. >>>> After all folders were created, I deleted a subset of the N folders (i.e., >>>> one out of every two folders). This would create many holes, which turned a >>>> subset of full slabs into partial slabs. >> >> Thanks, right, so that's quite a deterministic way to achieve the long >> partial lists with very close to uniform ratio of free/used, so no wonder >> the resulting accuracy is good and the diff is very small. But in practice >> the workloads that may lead to long lists will not be so uniform. The result >> after shrinking shows what happens if there's bias in which slabs we inspect >> due to the sorting. But still most of the slabs will have the near-uniform >> free/used ratio so the sorting will not do so much difference. But another >> workload might do that. >> >> So what happens if you inspect X slabs from the head and X from the tail as >> I suggested? That should help your test case even after you sort, and also >> should in theory be more accurate even for less uniform workloads. > > Yes, the approach of counting from both directions and then approximating > works better after sorting the partial list. Yeah I think we could go with that approach then. Let's do 5000 from each side. You can check whether n->nr_partial < 10000 and then just scan the whole list in single direction with no approximation, and otherwise 5000 from each side with approximation. I think the code as you show below will scan some slabs in the middle of the list twice if there's between 5000 and 10000 on the list, so checking n->nr_partial would avoid that. Thanks! > Here is what I did. > --- > +static unsigned long count_partial(struct kmem_cache_node *n, > + int (*get_count)(struct slab *)) > +{ > + unsigned long flags; > + unsigned long x = 0; > + unsigned long scanned = 0; > + struct slab *slab; > + > + spin_lock_irqsave(&n->list_lock, flags); > + list_for_each_entry(slab, &n->partial, slab_list) { > + x += get_count(slab); > + if (++scanned > MAX_PARTIAL_TO_SCAN) > + break; > + } > + > + if (scanned > max_partial_to_scan) { > + scanned = 0; > + list_for_each_entry_reverse(slab, &n->partial, slab_list) { > + x += get_count(slab); > + if (++scanned > MAX_PARTIAL_TO_SCAN) { > + /* Approximate total count of objects */ > + x = mult_frac(x, n->nr_partial, scanned * 2); > + x = min(x, node_nr_objs(n)); > + break; > + } > + } > + } > + spin_unlock_irqrestore(&n->list_lock, flags); > + return x; > +} > --- > > Results: > --- > * Pre-shrink: > MAX_SLAB_TO_SCAN | Diff (single-direction) | Diff (double-direction) | > 1000 | 0.43 % | 0.80 % | > 5000 | 0.06 % | 0.16 % | > 10000 | 0.02 % | -0.003 % | > 20000 | 0.009 % | 0.03 % | > > * After-shrink: > MAX_SLAB_TO_SCAN | Diff (single-direction) | Diff (double-direction) | > 1000 | 12.46 % | 3.60 % | > 5000 | 5.38 % | 0.22 % | > 10000 | 4.99 % | -0.06 % | > 20000 | 4.86 % | -0.17 % | > --- > > For |MAX_SLAB_TO_SCAN| >= 5000, count_partial() returns the exact > object count if the length of the partial list is not larger than > |MAX_SLAB_TO_SCAN|. Otherwise, it counts |MAX_SLAB_TO_SCAN| from both > the head and from the tail, and outputs an approximation that shows > <1% difference. > > With a slightly larger limit (like 10000), count_partial() should > produce the exact number for most cases (that won't lead to a > lockup) and avoid lockups with a good estimate.