On 2/19/24 10:29, Chengming Zhou wrote: > On 2024/2/19 16:30, Vlastimil Babka wrote: >> On 2/18/24 20:25, David Rientjes wrote: >>> On Thu, 15 Feb 2024, Jianfeng Wang wrote: >>> >>>> When reading "/proc/slabinfo", the kernel needs to report the number of >>>> free objects for each kmem_cache. The current implementation relies on >>>> count_partial() that counts the number of free objects by scanning each >>>> kmem_cache_node's partial slab list and summing free objects from all >>>> partial slabs in the list. This process must hold per kmem_cache_node >>>> spinlock and disable IRQ. Consequently, it can block slab allocation >>>> requests on other CPU cores and cause timeouts for network devices etc., >>>> if the partial slab list is long. In production, even NMI watchdog can >>>> be triggered because some slab caches have a long partial list: e.g., >>>> for "buffer_head", the number of partial slabs was observed to be ~1M >>>> in one kmem_cache_node. This problem was also observed by several > > Not sure if this situation is normal? It maybe very fragmented, right? > > SLUB completely depend on the timing order to place partial slabs in node, > which maybe suboptimal in some cases. Maybe we could introduce anti-fragment > mechanism like fullness grouping in zsmalloc to have multiple lists based > on fullness grouping? Just some random thoughts... :) Most likely that's wouldn't be feasible. When freeing to a slab on partial list that's just a cmpxchg128 (unless the slab become empty) and additional list manipulation to maintain the grouping would kill the performance.