Re: [PATCH] slub: avoid scanning all partial slabs in get_slabinfo()

Vlastimil Babka <vbabka@xxxxxxx> · Mon, 19 Feb 2024 11:17:52 +0100

On 2/19/24 10:29, Chengming Zhou wrote:
> On 2024/2/19 16:30, Vlastimil Babka wrote:
>> On 2/18/24 20:25, David Rientjes wrote:
>>> On Thu, 15 Feb 2024, Jianfeng Wang wrote:
>>>
>>>> When reading "/proc/slabinfo", the kernel needs to report the number of
>>>> free objects for each kmem_cache. The current implementation relies on
>>>> count_partial() that counts the number of free objects by scanning each
>>>> kmem_cache_node's partial slab list and summing free objects from all
>>>> partial slabs in the list. This process must hold per kmem_cache_node
>>>> spinlock and disable IRQ. Consequently, it can block slab allocation
>>>> requests on other CPU cores and cause timeouts for network devices etc.,
>>>> if the partial slab list is long. In production, even NMI watchdog can
>>>> be triggered because some slab caches have a long partial list: e.g.,
>>>> for "buffer_head", the number of partial slabs was observed to be ~1M
>>>> in one kmem_cache_node. This problem was also observed by several
> 
> Not sure if this situation is normal? It maybe very fragmented, right?
> 
> SLUB completely depend on the timing order to place partial slabs in node,
> which maybe suboptimal in some cases. Maybe we could introduce anti-fragment
> mechanism like fullness grouping in zsmalloc to have multiple lists based
> on fullness grouping? Just some random thoughts... :)

Most likely that's wouldn't be feasible. When freeing to a slab on partial
list that's just a cmpxchg128 (unless the slab become empty) and additional
list manipulation to maintain the grouping would kill the performance.