On 15/08/2023 17.53, Matthew Wilcox wrote:
On Tue, Aug 15, 2023 at 05:17:36PM +0200, Jesper Dangaard Brouer wrote:
For the bulk API to perform efficiently the slub fragmentation need to
be low. Especially for the SLUB allocator, the efficiency of bulk free
API depend on objects belonging to the same slab (page).
Hey Jesper,
You probably haven't seen this patch series from Vlastimil:
https://lore.kernel.org/linux-mm/20230810163627.6206-9-vbabka@xxxxxxx/
I wonder if you'd like to give it a try? It should provide some immunity
to this problem, and might even be faster than the current approach.
If it isn't, it'd be good to understand why, and if it could be improved.
I took a quick look at:
-
https://lore.kernel.org/linux-mm/20230810163627.6206-11-vbabka@xxxxxxx/#Z31mm:slub.c
To Vlastimil, sorry but I don't think this approach with spin_lock will
be faster than SLUB's normal fast-path using this_cpu_cmpxchg.
My experience is that SLUB this_cpu_cmpxchg trick is faster than spin_lock.
On my testlab CPU E5-1650 v4 @ 3.60GHz:
- spin_lock+unlock : 34 cycles(tsc) 9.485 ns
- this_cpu_cmpxchg : 5 cycles(tsc) 1.585 ns
- locked cmpxchg : 18 cycles(tsc) 5.006 ns
SLUB does use a cmpxchg_double which I don't have a microbench for.
No objection to this patch going in for now, of course.