On 2024/4/2 11:10, Ming Yang wrote: > When one of numa nodes runs out of memory and lots of processes still > booting, slabinfo shows much slub segmentation exits. The following > shows some of them: > > tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> > <num_slabs> <sharedavail> > kmalloc-512 84309 380800 1024 32 8 : > tunables 0 0 0 : slabdata 11900 11900 0 > kmalloc-256 65869 365408 512 32 4 : > tunables 0 0 0 : slabdata 11419 11419 0 > > 365408 "kmalloc-256" objects are alloced but only 65869 of them are > used; While 380800 "kmalloc-512" objects are alloced but only 84309 > of them are used. > > This problem exits in the following senario: > 1. Multiple numa nodes, e.g. four nodes. > 2. Lack of memory in any one node. > 3. Functions which alloc many slub memory in certain numa nodes, > like alloc_fair_sched_group. > > The slub segmentation generated because of the following reason: > In function "___slab_alloc" a new slab is attempted to be gotten via > function "get_partial". If the argument 'node' is assigned but there > are neither partial memory nor buddy memory in that assigned node, no > slab could be gotten. And then the program attempt to alloc new slub > from buddy system, as mentationed before: no buddy memory in that > assigned node left, a new slub might be alloced from the buddy system > of other node directly, no matter whether there is free partil memory > left on other node. As a result slub segmentation generated. > > The key point of above allocation flow is: the slab should be alloced > from the partial of other node first, instead of the buddy system of > other node directly. > > In this commit a new slub allocation flow is proposed: > 1. Attempt to get a slab via function get_partial (first step in > new_objects lable). > 2. If no slab is gotten and 'node' is assigned, try to alloc a new > slab just from the assigned node instead of all node. > 3. If no slab could be alloced from the assigned node, try to alloc > slub from partial of other node. > 4. If the alloctation in step 3 fails, alloc a new slub from buddy > system of all node. FYI, there is another patch to the very same problem: https://lore.kernel.org/all/20240330082335.29710-1-chenjun102@xxxxxxxxxx/ > > Signed-off-by: Ming Yang <yangming73@xxxxxxxxxx> > Signed-off-by: Liang Zhang <zhangliang5@xxxxxxxxxx> > Signed-off-by: Zhigang Wang <wangzhigang17@xxxxxxxxxx> > Reviewed-by: Shixin Liu <liushixin2@xxxxxxxxxx> > --- > This patch can be tested and verified by following steps: > 1. First, try to run out memory on node0. echo 1000(depending on your memory) > > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages. > 2. Second, boot 10000(depending on your memory) processes which use setsid > systemcall, as the setsid systemcall may likely call function > alloc_fair_sched_group. > 3. Last, check slabinfo, cat /proc/slabinfo. > > Hardware info: > Memory : 8GiB > CPU (total #): 120 > numa node: 4 > > Test clang code example: > int main() { > void *p = malloc(1024); > setsid(); > while(1); > } > > mm/slub.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/mm/slub.c b/mm/slub.c > index 1bb2a93cf7..3eb2e7d386 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -3522,7 +3522,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > } > > slub_put_cpu_ptr(s->cpu_slab); > + if (node != NUMA_NO_NODE) { > + slab = new_slab(s, gfpflags | __GFP_THISNODE, node); > + if (slab) > + goto slab_alloced; > + > + slab = get_any_partial(s, &pc); > + if (slab) > + goto slab_alloced; > + } > slab = new_slab(s, gfpflags, node); > + > +slab_alloced: > c = slub_get_cpu_ptr(s->cpu_slab); > > if (unlikely(!slab)) {