On 8/18/23 14:32, Jesper Dangaard Brouer wrote: > > > On 15/08/2023 17.53, Matthew Wilcox wrote: >> On Tue, Aug 15, 2023 at 05:17:36PM +0200, Jesper Dangaard Brouer wrote: >>> For the bulk API to perform efficiently the slub fragmentation need to >>> be low. Especially for the SLUB allocator, the efficiency of bulk free >>> API depend on objects belonging to the same slab (page). >> >> Hey Jesper, >> >> You probably haven't seen this patch series from Vlastimil: >> >> https://lore.kernel.org/linux-mm/20230810163627.6206-9-vbabka@xxxxxxx/ >> >> I wonder if you'd like to give it a try? It should provide some immunity >> to this problem, and might even be faster than the current approach. >> If it isn't, it'd be good to understand why, and if it could be improved. I didn't Cc Jesper on that yet, as the initial attempt was focused on the maple tree nodes use case. But you'll notice using the percpu array requires the cache to be created with SLAB_NO_MERGE anyway, so this patch would be still necessary :) > I took a quick look at: > - > https://lore.kernel.org/linux-mm/20230810163627.6206-11-vbabka@xxxxxxx/#Z31mm:slub.c > > To Vlastimil, sorry but I don't think this approach with spin_lock will > be faster than SLUB's normal fast-path using this_cpu_cmpxchg. > > My experience is that SLUB this_cpu_cmpxchg trick is faster than spin_lock. > > On my testlab CPU E5-1650 v4 @ 3.60GHz: > - spin_lock+unlock : 34 cycles(tsc) 9.485 ns > - this_cpu_cmpxchg : 5 cycles(tsc) 1.585 ns > - locked cmpxchg : 18 cycles(tsc) 5.006 ns Hm that's unexpected difference between spin_lock+unlock where AFAIK spin_lock is basically a locked cmpxchg and unlock a simple write, and I assume these measurements are on uncontended lock? > SLUB does use a cmpxchg_double which I don't have a microbench for. Yeah it's possible the _double will be slower. Yeah the locking will have to be considered more thoroughly for the percpu array. >> No objection to this patch going in for now, of course. >> >