On 4/12/24 12:48 AM, Vlastimil Babka wrote: > On 4/11/24 7:02 PM, Christoph Lameter (Ampere) wrote: >> On Thu, 11 Apr 2024, Jianfeng Wang wrote: >> >>> So, the fix is to limit the number of slabs to scan in >>> count_partial(), and output an approximated result if the list is too >>> long. Default to 10000 which should be enough for most sane cases. >> >> >> That is a creative approach. The problem though is that objects on the >> partial lists are kind of sorted. The partial slabs with only a few >> objects available are at the start of the list so that allocations cause >> them to be removed from the partial list fast. Full slabs do not need to >> be tracked on any list. >> >> The partial slabs with few objects are put at the end of the partial list >> in the hope that the few objects remaining will also be freed which would >> allow the freeing of the slab folio. >> >> So the object density may be higher at the beginning of the list. >> >> kmem_cache_shrink() will explicitly sort the partial lists to put the >> partial pages in that order. >> >> Can you run some tests showing the difference between the estimation and >> the real count? Yes. On a server with one NUMA node, I create a case that uses many dentry objects. For "dentry", the length of partial slabs is slightly above 250000. Then, I compare my approach of scanning N slabs from the list's head v.s. the original approach of scanning the full list. I do it by getting both results using the new and the original count_partial() and printing them in /proc/slabinfo. N = 10000 my_result = 4741651 org_result = 4744966 diff = (org_result - my_result) / org_result = 0.00069 = 0.069 % Increasing N further to 25000 will only slight improve the accuracy: N = 15000 -> diff = 0.02 % N = 20000 -> diff = 0.01 % N = 25000 -> diff = -0.017 % Based on the measurement, I think the difference between the estimation and the real count is very limited (i.e. less than 0.1% for N = 10000). The benefit is significant: shorter execution time for get_slabinfo(); no more soft lockups or crashes caused by count_partial(). > > Maybe we could also get a more accurate picture by counting N slabs from the > head and N from the tail and approximating from both. Also not perfect, but > could be able to answer the question if the kmem_cache is significantly > fragmented. Which is probably the only information we can get from the > slabinfo <active_objs> vs <num_objs>. IIRC the latter is always accurate, > the former never because of cpu slabs, so we never know how many objects are > exactly in use. By comparing both we can get an idea of the fragmentation, > and if this change won't make that estimate significantly worse, it should > be acceptable.