On Wed, Oct 28, 2020 at 11:20:30AM +0530, Bharata B Rao wrote: > Hi, > > On POWER systems, where 64K PAGE_SIZE is default, I see that slub > consumes higher amount of memory compared to any 4K page-size system. > While slub is obviously going to consume more memory on 64K page-size > systems compared to 4K as slabs are allocated in page-size granularity, > I want to check if there are any obvious tuning (via existing tunables > or via some code change) that we can do to reduce the amount of memory > consumed by slub. > > Here is a comparision of the slab memory consumption between 4K and > 64K page-size pseries hash KVM guest with 16 cores and 16G memory > configuration immediately after boot: > > 64K 209280 kB > 4K 67636 kB > > 64K configuration may never be able to consume as less as a 4K configuration, > but it certainly shows that the slub can be optimized for 64K page-size better. > > slub_max_order > -------------- > The most promising tunable that shows consistent reduction in slab memory > is slub_max_order. Here is a table that shows the number of slabs that > end up with different orders and the total slab consumption at boot > for different values of slub_max_order: > ------------------------------------------- > slub_max_order Order NrSlabs Slab memory > ------------------------------------------- > 0 276 > 3 1 16 207488 kB > (default) 2 4 > 3 11 > ------------------------------------------- > 0 276 > 2 1 16 166656 kB > 2 4 > ------------------------------------------- > 0 276 144128 kB > 1 1 31 > ------------------------------------------- > > Though only a few bigger sized caches fall into order-2 or order-3, they > seem to make a considerable difference to the overall slab consumption. > If we take task_struct cache as an example, this is how it ends up when > slub_max_order is varied: > > task_struct, objsize=9856 > -------------------------------------------- > slub_max_order objperslab pagesperslab > -------------------------------------------- > 3 53 8 > 2 26 4 > 1 13 2 > -------------------------------------------- > > The slab page-order and hence the number of objects in a slab has a > bearing on the performance, but I wonder if some caches like task_struct > above can be auto-tuned to fall into a conservative order and do good > both wrt both memory and performance? > > mm/slub.c:calulate_order() has the logic which determines the the > page-order for the slab. It starts with min_objects and attempts > to arrive at the best configuration for the slab. The min_objects > is starts like this: > > min_objects = 4 * (fls(nr_cpu_ids) + 1); > > Here nr_cpu_ids depends on the maxcpus and hence this can have a > significant effect on those systems which define maxcpus. Slab numbers > post-boot for a KVM pseries guest that has 16 boottime CPUs and varying > number of maxcpus look like this: > ------------------------------- > maxcpus Slab memory(kB) > ------------------------------- > 64 209280 > 256 253824 > 512 293824 > ------------------------------- > > Page-order is a one time setting and obviously can't be tweaked dynamically > on CPU hotplug, but just wanted to bring out the effect of the same. > > And that constant multiplicative factor of 4 was infact added by the commit > 9b2cd506e5f2 - "slub: Calculate min_objects based on number of processors." > > Reducing that to say 2, does give some reduction in the slab memory > and also same hackbench performance with reduced slab memory, but I am not > sure if that could be assumed to be beneficial for all scenarios. > > MIN_PARTIAL > ----------- > This determines the number of slabs left on the partial list even if they > are empty. My initial thought was that the default MIN_PARTIAL value of 5 > is on the higher side and we are accumulating MIN_PARTIAL number of > empty slabs in all caches without freeing them. However I hardly find > the case where an empty slab is retained during freeing on account of > partial slabs being lesser than MIN_PARTIAL. > > However what I find in practice is that we are accumulating a lot of partial > slabs with just one in-use object in the whole slab. High number of such > partial slabs is indeed contributing to the increased slab memory consumption. > > For example, after a hackbench run, I find the distribution of objects > like this for kmalloc-2k cache: > > total_objects 3168 > objects 1611 > Nr partial slabs 54 > Nr parital slabs with > just 1 inuse object 38 > > With 64K page-size, so many partial slabs with just 1 inuse object can > result in high memory usage. Is there any workaround possible prevent this > kind of situation? > > cpu_partial > ----------- > Here is how the slab consumption post-boot varies when all the slab > caches are forced with the fixed cpu_partial value: > --------------------------- > cpu_partial Slab Memory > --------------------------- > 0 175872 kB > 2 187136 kB > 4 191616 kB > default 204864 kB > --------------------------- > > It has been suggested earlier that reducing cpu_partial and/or making > cpu_partial 64K page-size aware will benefit. In set_cpu_partial(), > for bigger sized slabs (size > PAGE_SIZE), cpu_partial is already set > to 2. A bit of tweaking there to introduce cpu_partial=1 for certain > slabs does give some benefit. > > diff --git a/mm/slub.c b/mm/slub.c > index a28ed9b8fc61..e09eff1199bf 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -3626,7 +3626,9 @@ static void set_cpu_partial(struct kmem_cache *s) > */ > if (!kmem_cache_has_cpu_partial(s)) > slub_set_cpu_partial(s, 0); > - else if (s->size >= PAGE_SIZE) > + else if (s->size >= 8192) > + slub_set_cpu_partial(s, 1); > + else if (s->size >= 4096) > slub_set_cpu_partial(s, 2); > else if (s->size >= 1024) > slub_set_cpu_partial(s, 6); > > With the above change, the slab consumption post-boot reduces to 186048 kB. > Also, here are the hackbench numbers with and w/o the above change: > > Average of 10 runs of 'hackbench -s 1024 -l 200 -g 200 -f 25 -P' > Slab consumption captured at the end of each run > -------------------------------------------------------------- > Time Slab memory > -------------------------------------------------------------- > Default 11.124s 645580 kB > Patched 11.032s 584352 kB > -------------------------------------------------------------- > > I have mostly looked at reducing the slab memory consumption here. > But I do understand that default tunable values have been arrived > at based on some benchmark numbers. Are there ways or possibilities > to reduce the slub memory consumption with the existing level of > performance is what I would like to understand and explore. Hi Bharata! I wonder how the distribution of the consumed memory by slab_caches differs between 4k and 64k pages. In particular, I wonder if page-sized and larger kmallocs make the difference (or a big part of it)? There are many places in the kernel which are doing something like kmalloc(PAGE_SIZE). Re slub tuning: in general we do care about the number of objects in a partial list, less about the number of pages. If we can have the same amount of objects but on fewer pages, it's even better. So I don't see any reasons why we shouldn't scale down these tunables if the PAGE_SIZE > 4K. Idk if it makes sense to switch to byte-sized tunables or just to hardcode custom default values for the 64k page case. The latter is probably is easier. Thanks!