Re: slub: bulk allocation from per cpu partial pages

Christoph Lameter <cl@xxxxxxxxx> · Thu, 16 Apr 2015 10:54:07 -0500 (CDT)

On Thu, 16 Apr 2015, Jesper Dangaard Brouer wrote:

> On CPU E5-2630 @ 2.30GHz, the cost of kmem_cache_alloc +
> kmem_cache_free, is a tight loop (most optimal fast-path), cost 22ns.
> With elem size 256 bytes, where slab chooses to make 32 obj-per-slab.
>
> With this patch, testing different bulk sizes, the cost of alloc+free
> per element is improved for small sizes of bulk (which I guess this the
> is expected outcome).
>
> Have something to compare against, I also ran the bulk sizes through
> the fallback versions __kmem_cache_alloc_bulk() and
> __kmem_cache_free_bulk(), e.g. the none optimized versions.
>
>  size    --  optimized -- fallback
>  bulk  8 --  15ns      --  22ns
>  bulk 16 --  15ns      --  22ns

Good.

>  bulk 30 --  44ns      --  48ns
>  bulk 32 --  47ns      --  50ns
>  bulk 64 --  52ns      --  54ns

Hmm.... We are hittling the atomics I guess... What you got so far is only
using the per cpu data. Wonder how many partial pages are available
there and how much is satisfied from which per cpu structure. There are a
couple of cmpxchg_doubles in the optimized patch to squeeze even the last
object out of the pages before going to the next. I could avoid those
and simply rotate to another per cpu partial page instead.

Got some more here that deals with per node partials but at that point we
will be taking spinlocks.

> For smaller bulk sizes 8 and 16, this is actually a significant
> improvement, especially considering the free side is not optimized.

I have some draft code here to do the same for the free side. But I
thought we better get to some working code on the free side first.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>