On Fri, 3 Jul 2015 09:26:05 +0800 Tang Chen <tangchen@xxxxxxxxxxxxxx> wrote: > > On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote: > > Hi Tang, > > > >> On my box, if I run lscpu, the output looks like this: > >> > >> NUMA node0 CPU(s): 0-14,128-142 > >> NUMA node1 CPU(s): 15-29,143-157 > >> NUMA node2 CPU(s): > >> NUMA node3 CPU(s): > >> NUMA node4 CPU(s): 62-76,190-204 > >> NUMA node5 CPU(s): 78-92,206-220 > >> > >> Node 2 and 3 are not exist, but they are online. > > According your description of patch, node 4 and 5 are mistakenly > > Not node 4 and 5, it is node 2 and 3 which are mistakenly set online. Please add the results of lscpu before/after applyinig the patch into description of your patch. Feel free to add my Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@xxxxxxxxxxxxxx> Thanks, Yasuaki Ishimatsu > > set to online. Why does lscpu show the above result? > > Well, actually not only lscpu gives the strange result, under > /sys/device/system/node, > interfaces for node 2 and 3 are also created. > > I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But > obviously, > node 2 and 3 are set online, which is incorrect. > > For now, I only found that in numa_cleanup_meminfo(), memory above > max_pfn is removed, > but holes between nodes are not removed. > > I think libraries are not able to handle this problem since nodes are > set online in kernel. > Seeing from user space, there is no hole. > > Thanks. > > > > > Thanks, > > Yasuaki Ishimatsu > > > > On Wed, 1 Jul 2015 15:55:30 +0800 > > Tang Chen <tangchen@xxxxxxxxxxxxxx> wrote: > > > >> On 07/01/2015 02:25 PM, Xishi Qiu wrote: > >>> On 2015/7/1 11:16, Tang Chen wrote: > >>> > >>>> When parsing SRAT, all memory ranges are added into numa_meminfo. > >>>> In numa_init(), before entering numa_cleanup_meminfo(), all possible > >>>> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes > >>>> all ranges over max_pfn or empty. > >>>> > >>>> But, this only works if the nodes are continuous. Let's have a look > >>>> at the following example: > >>>> > >>>> We have an SRAT like this: > >>>> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff] > >>>> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff] > >>>> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff] > >>>> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug > >>>> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug > >>>> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug > >>>> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug > >>>> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug > >>>> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug > >>>> > >>>> On boot, only node 0,1,2,3 exist. > >>>> > >>>> And the numa_meminfo will look like this: > >>>> numa_meminfo.nr_blks = 9 > >>>> 1. on node 0: [0, 60000000] > >>>> 2. on node 0: [100000000, 20000000000] > >>>> 3. on node 1: [20000000000, 40000000000] > >>>> 4. on node 4: [40000000000, 60000000000] > >>>> 5. on node 5: [60000000000, 80000000000] > >>>> 6. on node 2: [80000000000, a0000000000] > >>>> 7. on node 3: [a0000000000, a0800000000] > >>>> 8. on node 6: [c0000000000, a0800000000] > >>>> 9. on node 7: [e0000000000, a0800000000] > >>>> > >>>> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because > >>>> the end address is over max_pfn, which is a0800000000. But 4 and 5 > >>>> are not removed because their end addresses are less then max_pfn. > >>>> But in fact, node 4 and 5 don't exist. > >>>> > >>>> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. > >>>> > >>>> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), > >>>> node 4 and 5 will be mistakenly set to online. > >>>> > >>>> In this patch, we use memblock_overlaps_region() to check if ranges in > >>>> numa_meminfo overlap with ranges in memory_block. Since memory_block contains > >>>> all available memory at boot time, if they overlap, it means the ranges > >>>> exist. If not, then remove them from numa_meminfo. > >>>> > >>> Hi Tang Chen, > >>> > >>> What's the impact of this problem? > >>> > >>> Command "numactl --hard" will show an empty node(no cpu and no memory, > >>> but pgdat is created), right? > >> On my box, if I run lscpu, the output looks like this: > >> > >> NUMA node0 CPU(s): 0-14,128-142 > >> NUMA node1 CPU(s): 15-29,143-157 > >> NUMA node2 CPU(s): > >> NUMA node3 CPU(s): > >> NUMA node4 CPU(s): 62-76,190-204 > >> NUMA node5 CPU(s): 78-92,206-220 > >> > >> Node 2 and 3 are not exist, but they are online. > >> > >> Thanks. > >> > >>> Thanks, > >>> Xishi Qiu > >>> > >>>> Signed-off-by: Tang Chen <tangchen@xxxxxxxxxxxxxx> > >>>> --- > >>>> arch/x86/mm/numa.c | 6 ++++-- > >>>> include/linux/memblock.h | 2 ++ > >>>> mm/memblock.c | 2 +- > >>>> 3 files changed, 7 insertions(+), 3 deletions(-) > >>>> > >>>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c > >>>> index 4053bb5..0c55cc5 100644 > >>>> --- a/arch/x86/mm/numa.c > >>>> +++ b/arch/x86/mm/numa.c > >>>> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) > >>>> bi->start = max(bi->start, low); > >>>> bi->end = min(bi->end, high); > >>>> > >>>> - /* and there's no empty block */ > >>>> - if (bi->start >= bi->end) > >>>> + /* and there's no empty or non-exist block */ > >>>> + if (bi->start >= bi->end || > >>>> + memblock_overlaps_region(&memblock.memory, > >>>> + bi->start, bi->end - bi->start) == -1) > >>>> numa_remove_memblk_from(i--, mi); > >>>> } > >>>> > >>>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h > >>>> index 0215ffd..3bf6cc1 100644 > >>>> --- a/include/linux/memblock.h > >>>> +++ b/include/linux/memblock.h > >>>> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); > >>>> int memblock_free(phys_addr_t base, phys_addr_t size); > >>>> int memblock_reserve(phys_addr_t base, phys_addr_t size); > >>>> void memblock_trim_memory(phys_addr_t align); > >>>> +long memblock_overlaps_region(struct memblock_type *type, > >>>> + phys_addr_t base, phys_addr_t size); > >>>> int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); > >>>> int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); > >>>> int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); > >>>> diff --git a/mm/memblock.c b/mm/memblock.c > >>>> index 1b444c7..55b5f9f 100644 > >>>> --- a/mm/memblock.c > >>>> +++ b/mm/memblock.c > >>>> @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p > >>>> return ((base1 < (base2 + size2)) && (base2 < (base1 + size1))); > >>>> } > >>>> > >>>> -static long __init_memblock memblock_overlaps_region(struct memblock_type *type, > >>>> +long __init_memblock memblock_overlaps_region(struct memblock_type *type, > >>>> phys_addr_t base, phys_addr_t size) > >>>> { > >>>> unsigned long i; > >>> > >>> . > >>> > > . > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>