Re: [GIT PULL] Lockless SLUB slowpaths for v3.1-rc1

David Rientjes <rientjes@xxxxxxxxxx> · Sun, 31 Jul 2011 14:55:20 -0700 (PDT)

On Sun, 31 Jul 2011, Pekka Enberg wrote:

> > And although slub is definitely heading in the right direction regarding 
> > the netperf benchmark, it's still a non-starter for anybody using large 
> > NUMA machines for networking performance.  On my 16-core, 4 node, 64GB 
> > client/server machines running netperf TCP_RR with various thread counts 
> > for 60 seconds each on 3.0:
> > 
> > 	threads		SLUB		SLAB		diff
> > 	 16		76345		74973		- 1.8%
> > 	 32		116380		116272		- 0.1%
> > 	 48		150509		153703		+ 2.1%
> > 	 64		187984		189750		+ 0.9%
> > 	 80		216853		224471		+ 3.5%
> > 	 96		236640		249184		+ 5.3%
> > 	112		256540		275464		+ 7.4%
> > 	128		273027		296014		+ 8.4%
> > 	144		281441		314791		+11.8%
> > 	160		287225		326941		+13.8%
> 
> That looks like a pretty nasty scaling issue. David, would it be
> possible to see 'perf report' for the 160 case? [ Maybe even 'perf
> annotate' for the interesting SLUB functions. ]
> 

More interesting than the perf report (which just shows kfree, 
kmem_cache_free, kmem_cache_alloc dominating) is the statistics that are 
exported by slub itself, it shows the "slab thrashing" issue that I 
described several times over the past few years.  It's difficult to 
address because it's a result of slub's design.  From the client side of 
160 netperf TCP_RR threads for 60 seconds:

	cache		alloc_fastpath		alloc_slowpath
	kmalloc-256	10937512 (62.8%)	6490753
	kmalloc-1024	17121172 (98.3%)	303547
	kmalloc-4096	5526281			11910454 (68.3%)

	cache		free_fastpath		free_slowpath
	kmalloc-256	15469			17412798 (99.9%)
	kmalloc-1024	11604742 (66.6%)	5819973
	kmalloc-4096	14848			17421902 (99.9%)

With those stats, there's no way that slub will even be able to compete 
with slab because it's not optimized for the slowpath.  There are ways to 
mitigate that, like with my slab thrashing patchset from a couple years 
ago that you tracked for a while that improved performance 3-4% at the 
overhead of an increment in the fastpath, but everything else requires 
more memory.  You could preallocate the slabs on the partial list, 
increase the per-node min_partial, increase the order of the slabs 
themselves so you hit the free fastpath much more often, etc, but they all 
come at a considerable cost in memory.

I'm very confident that slub could beat slab on any system if you throw 
enough memory at it because its fastpaths are extremely efficient, but 
there's no business case for that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>