gather_bootmem_prealloc() function assumes the start nid as 0 and size as num_node_state(N_MEMORY). Since memory attached numa nodes can be interleaved in any fashion, hence ensure current code checks for all online numa nodes as part of gather_bootmem_prealloc_parallel(). Let's still make max_threads as N_MEMORY so that we can possibly have a uniform distribution of online nodes among these parallel threads. e.g. qemu cmdline ======================== numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20" mem_cmd="-object memory-backend-ram,id=mem1,size=16G" w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): ========================== ~ # cat /proc/meminfo |grep -i huge AnonHugePages: 0 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 1048576 kB Hugetlb: 0 kB with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): =========================== ~ # cat /proc/meminfo |grep -i huge AnonHugePages: 0 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 2 HugePages_Free: 2 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 1048576 kB Hugetlb: 2097152 kB Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization") Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> Cc: Donet Tom <donettom@xxxxxxxxxxxxx> Cc: Gang Li <gang.li@xxxxxxxxx> Cc: Daniel Jordan <daniel.m.jordan@xxxxxxxxxx> Cc: Muchun Song <muchun.song@xxxxxxxxx> Cc: David Rientjes <rientjes@xxxxxxxxxx> Cc: linux-mm@xxxxxxxxx --- ==== Additional data ==== w/o this patch: ================ ~ # dmesg |grep -Ei "numa|node|huge" [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes. [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0 [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff] [ 0.000000][ T0] numa: NODE_DATA(0) on node 1 [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff] [ 0.000000][ T0] Movable zone start for each node [ 0.000000][ T0] Early memory node ranges [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff] [ 0.000000][ T0] Initmem setup node 0 as memoryless [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff] [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2 [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear) [ 0.000000][ T0] Fallback order for Node 0: 1 [ 0.000000][ T0] Fallback order for Node 1: 1 [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2 [ 0.044978][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl [ 0.209159][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level [ 0.414281][ T1] smp: Brought up 2 nodes, 4 CPUs [ 0.415268][ T1] numa: Node 0 CPUs: 0-1 [ 0.416030][ T1] numa: Node 1 CPUs: 2-3 [ 13.644459][ T41] node 1 deferred pages initialised in 12040ms [ 14.241701][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages [ 14.242781][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page [ 14.243806][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages [ 14.244753][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page [ 16.490452][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced [ 27.804266][ T1] Demotion targets for Node 1: null with this patch: ================= ~ # dmesg |grep -Ei "numa|node|huge" [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes. [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0 [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff] [ 0.000000][ T0] numa: NODE_DATA(0) on node 1 [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff] [ 0.000000][ T0] Movable zone start for each node [ 0.000000][ T0] Early memory node ranges [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff] [ 0.000000][ T0] Initmem setup node 0 as memoryless [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff] [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2 [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear) [ 0.000000][ T0] Fallback order for Node 0: 1 [ 0.000000][ T0] Fallback order for Node 1: 1 [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2 [ 0.048825][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl [ 0.204211][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level [ 0.378821][ T1] smp: Brought up 2 nodes, 4 CPUs [ 0.379642][ T1] numa: Node 0 CPUs: 0-1 [ 0.380302][ T1] numa: Node 1 CPUs: 2-3 [ 11.577527][ T41] node 1 deferred pages initialised in 10250ms [ 12.557856][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 2 pages [ 12.574197][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page [ 12.576339][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages [ 12.577262][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page [ 15.102445][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced [ 26.173888][ T1] Demotion targets for Node 1: null mm/hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9a3a6e2dee97..60f45314c151 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3443,7 +3443,7 @@ static void __init gather_bootmem_prealloc(void) .thread_fn = gather_bootmem_prealloc_parallel, .fn_arg = NULL, .start = 0, - .size = num_node_state(N_MEMORY), + .size = num_node_state(N_ONLINE), .align = 1, .min_chunk = 1, .max_threads = num_node_state(N_MEMORY), -- 2.39.5