On 4/2/24 5:45 AM, Chengming Zhou wrote: > On 2024/4/2 11:10, Ming Yang wrote: >> When one of numa nodes runs out of memory and lots of processes still >> booting, slabinfo shows much slub segmentation exits. The following You mean fragmentation not segmentation, right? >> shows some of them: >> >> tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> >> <num_slabs> <sharedavail> >> kmalloc-512 84309 380800 1024 32 8 : >> tunables 0 0 0 : slabdata 11900 11900 0 >> kmalloc-256 65869 365408 512 32 4 : >> tunables 0 0 0 : slabdata 11419 11419 0 >> >> 365408 "kmalloc-256" objects are alloced but only 65869 of them are >> used; While 380800 "kmalloc-512" objects are alloced but only 84309 >> of them are used. >> >> This problem exits in the following senario: >> 1. Multiple numa nodes, e.g. four nodes. >> 2. Lack of memory in any one node. >> 3. Functions which alloc many slub memory in certain numa nodes, >> like alloc_fair_sched_group. >> >> The slub segmentation generated because of the following reason: >> In function "___slab_alloc" a new slab is attempted to be gotten via >> function "get_partial". If the argument 'node' is assigned but there >> are neither partial memory nor buddy memory in that assigned node, no >> slab could be gotten. And then the program attempt to alloc new slub >> from buddy system, as mentationed before: no buddy memory in that >> assigned node left, a new slub might be alloced from the buddy system >> of other node directly, no matter whether there is free partil memory >> left on other node. As a result slub segmentation generated. >> >> The key point of above allocation flow is: the slab should be alloced >> from the partial of other node first, instead of the buddy system of >> other node directly. >> >> In this commit a new slub allocation flow is proposed: >> 1. Attempt to get a slab via function get_partial (first step in >> new_objects lable). >> 2. If no slab is gotten and 'node' is assigned, try to alloc a new >> slab just from the assigned node instead of all node. >> 3. If no slab could be alloced from the assigned node, try to alloc >> slub from partial of other node. >> 4. If the alloctation in step 3 fails, alloc a new slub from buddy >> system of all node. > > FYI, there is another patch to the very same problem: > > https://lore.kernel.org/all/20240330082335.29710-1-chenjun102@xxxxxxxxxx/ Yeah and I have just taken that one to slab/for-6.10 >> >> Signed-off-by: Ming Yang <yangming73@xxxxxxxxxx> >> Signed-off-by: Liang Zhang <zhangliang5@xxxxxxxxxx> >> Signed-off-by: Zhigang Wang <wangzhigang17@xxxxxxxxxx> >> Reviewed-by: Shixin Liu <liushixin2@xxxxxxxxxx> >> --- >> This patch can be tested and verified by following steps: >> 1. First, try to run out memory on node0. echo 1000(depending on your memory) > >> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages. >> 2. Second, boot 10000(depending on your memory) processes which use setsid >> systemcall, as the setsid systemcall may likely call function >> alloc_fair_sched_group. >> 3. Last, check slabinfo, cat /proc/slabinfo. >> >> Hardware info: >> Memory : 8GiB >> CPU (total #): 120 >> numa node: 4 >> >> Test clang code example: >> int main() { >> void *p = malloc(1024); >> setsid(); >> while(1); >> } >> >> mm/slub.c | 11 +++++++++++ >> 1 file changed, 11 insertions(+) >> >> diff --git a/mm/slub.c b/mm/slub.c >> index 1bb2a93cf7..3eb2e7d386 100644 >> --- a/mm/slub.c >> +++ b/mm/slub.c >> @@ -3522,7 +3522,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, >> } >> >> slub_put_cpu_ptr(s->cpu_slab); >> + if (node != NUMA_NO_NODE) { >> + slab = new_slab(s, gfpflags | __GFP_THISNODE, node); >> + if (slab) >> + goto slab_alloced; >> + >> + slab = get_any_partial(s, &pc); >> + if (slab) >> + goto slab_alloced; >> + } >> slab = new_slab(s, gfpflags, node); >> + >> +slab_alloced: >> c = slub_get_cpu_ptr(s->cpu_slab); >> >> if (unlikely(!slab)) {