Re: Higher slub memory consumption on 64K page-size systems?

Roman Gushchin <guro@xxxxxx> · Wed, 28 Oct 2020 17:07:57 -0700

On Wed, Oct 28, 2020 at 11:20:30AM +0530, Bharata B Rao wrote:
> Hi,
> 
> On POWER systems, where 64K PAGE_SIZE is default, I see that slub
> consumes higher amount of memory compared to any 4K page-size system.
> While slub is obviously going to consume more memory on 64K page-size
> systems compared to 4K as slabs are allocated in page-size granularity,
> I want to check if there are any obvious tuning (via existing tunables
> or via some code change) that we can do to reduce the amount of memory
> consumed by slub.
> 
> Here is a comparision of the slab memory consumption between 4K and
> 64K page-size pseries hash KVM guest with 16 cores and 16G memory
> configuration immediately after boot:
> 
> 64K	209280 kB
> 4K	67636 kB
> 
> 64K configuration may never be able to consume as less as a 4K configuration,
> but it certainly shows that the slub can be optimized for 64K page-size better.
> 
> slub_max_order
> --------------
> The most promising tunable that shows consistent reduction in slab memory
> is slub_max_order. Here is a table that shows the number of slabs that
> end up with different orders and the total slab consumption at boot
> for different values of slub_max_order:
> -------------------------------------------
> slub_max_order	Order	NrSlabs	Slab memory
> -------------------------------------------
> 		0	276
> 	3	1	16	207488 kB
>     (default)	2	4
> 		3	11
> -------------------------------------------
> 		0	276
> 	2	1	16	166656 kB
> 		2	4
> -------------------------------------------
> 		0	276	144128 kB
> 	1	1	31
> -------------------------------------------
> 
> Though only a few bigger sized caches fall into order-2 or order-3, they
> seem to make a considerable difference to the overall slab consumption.
> If we take task_struct cache as an example, this is how it ends up when
> slub_max_order is varied:
> 
> task_struct, objsize=9856
> --------------------------------------------
> slub_max_order	objperslab	pagesperslab
> --------------------------------------------
> 3		53		8
> 2		26		4
> 1		13		2
> --------------------------------------------
> 
> The slab page-order and hence the number of objects in a slab has a
> bearing on the performance, but I wonder if some caches like task_struct
> above can be auto-tuned to fall into a conservative order and do good
> both wrt both memory and performance?
> 
> mm/slub.c:calulate_order() has the logic which determines the the
> page-order for the slab. It starts with min_objects and attempts
> to arrive at the best configuration for the slab. The min_objects
> is starts like this:
> 
> min_objects = 4 * (fls(nr_cpu_ids) + 1);
> 
> Here nr_cpu_ids depends on the maxcpus and hence this can have a
> significant effect on those systems which define maxcpus. Slab numbers
> post-boot for a KVM pseries guest that has 16 boottime CPUs and varying
> number of maxcpus look like this:
> -------------------------------
> maxcpus		Slab memory(kB)
> -------------------------------
> 64		209280
> 256		253824
> 512		293824
> -------------------------------
> 
> Page-order is a one time setting and obviously can't be tweaked dynamically
> on CPU hotplug, but just wanted to bring out the effect of the same.
> 
> And that constant multiplicative factor of 4 was infact added by the commit
> 9b2cd506e5f2 - "slub: Calculate min_objects based on number of processors."
> 
> Reducing that to say 2, does give some reduction in the slab memory
> and also same hackbench performance with reduced slab memory, but I am not
> sure if that could be assumed to be beneficial for all scenarios.
> 
> MIN_PARTIAL
> -----------
> This determines the number of slabs left on the partial list even if they
> are empty. My initial thought was that the default MIN_PARTIAL value of 5
> is on the higher side and we are accumulating MIN_PARTIAL number of
> empty slabs in all caches without freeing them. However I hardly find
> the case where an empty slab is retained during freeing on account of
> partial slabs being lesser than MIN_PARTIAL.
> 
> However what I find in practice is that we are accumulating a lot of partial
> slabs with just one in-use object in the whole slab. High number of such
> partial slabs is indeed contributing to the increased slab memory consumption.
> 
> For example, after a hackbench run, I find the distribution of objects
> like this for kmalloc-2k cache:
> 
> total_objects		3168
> objects			1611
> Nr partial slabs	54
> Nr parital slabs with
> just 1 inuse object	38
> 
> With 64K page-size, so many partial slabs with just 1 inuse object can
> result in high memory usage. Is there any workaround possible prevent this
> kind of situation?
> 
> cpu_partial
> -----------
> Here is how the slab consumption post-boot varies when all the slab
> caches are forced with the fixed cpu_partial value:
> ---------------------------
> cpu_partial	Slab Memory
> ---------------------------
> 0		175872 kB
> 2		187136 kB
> 4		191616 kB
> default		204864 kB
> ---------------------------
> 
> It has been suggested earlier that reducing cpu_partial and/or making
> cpu_partial 64K page-size aware will benefit. In set_cpu_partial(),
> for bigger sized slabs (size > PAGE_SIZE), cpu_partial is already set
> to 2. A bit of tweaking there to introduce cpu_partial=1 for certain
> slabs does give some benefit.
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index a28ed9b8fc61..e09eff1199bf 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3626,7 +3626,9 @@ static void set_cpu_partial(struct kmem_cache *s)
>          */
>         if (!kmem_cache_has_cpu_partial(s))
>                 slub_set_cpu_partial(s, 0);
> -       else if (s->size >= PAGE_SIZE)
> +       else if (s->size >= 8192)
> +               slub_set_cpu_partial(s, 1);
> +       else if (s->size >= 4096)
>                 slub_set_cpu_partial(s, 2);
>         else if (s->size >= 1024)
>                 slub_set_cpu_partial(s, 6);
> 
> With the above change, the slab consumption post-boot reduces to 186048 kB.
> Also, here are the hackbench numbers with and w/o the above change:
> 
> Average of 10 runs of 'hackbench -s 1024 -l 200 -g 200 -f 25 -P'
> Slab consumption captured at the end of each run
> --------------------------------------------------------------
> 		Time		Slab memory
> --------------------------------------------------------------
> Default		11.124s		645580 kB
> Patched		11.032s		584352 kB
> --------------------------------------------------------------
> 
> I have mostly looked at reducing the slab memory consumption here.
> But I do understand that default tunable values have been arrived
> at based on some benchmark numbers. Are there ways or possibilities
> to reduce the slub memory consumption with the existing level of
> performance is what I would like to understand and explore.

Hi Bharata!

I wonder how the distribution of the consumed memory by slab_caches
differs between 4k and 64k pages. In particular, I wonder if
page-sized and larger kmallocs make the difference (or a big part of it)?
There are many places in the kernel which are doing something like
kmalloc(PAGE_SIZE).

Re slub tuning: in general we do care about the number of objects
in a partial list, less about the number of pages. If we can have the
same amount of objects but on fewer pages, it's even better.
So I don't see any reasons why we shouldn't scale down these tunables
if the PAGE_SIZE > 4K.
Idk if it makes sense to switch to byte-sized tunables or just to hardcode
custom default values for the 64k page case. The latter is probably
is easier.

Thanks!