Re: [Patch V3 0/9] Enable memoryless node support for x86

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 08/17/2015 11:18 AM, Jiang Liu wrote:
This is the third version to enable memoryless node support on x86
platforms. The previous version (https://lkml.org/lkml/2014/7/11/75)
blindly replaces numa_node_id()/cpu_to_node() with numa_mem_id()/
cpu_to_mem(). That's not the right solution as pointed out by Tejun
and Peter due to:
1) We shouldn't shift the burden to normal slab users.
2) Details of memoryless node should be hidden in arch and mm code
    as much as possible.

After digging into more code and documentation, we found the rules to
deal with memoryless node should be:
1) Arch code should online corresponding NUMA node before onlining any
    CPU or memory, otherwise it may cause invalid memory access when
    accessing NODE_DATA(nid).
2) For normal memory allocations without __GFP_THISNODE setting in the
    gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of
    numa_mem_id()/cpu_to_mem() because the latter loses hardware topology
    information as pointed out by Tejun:
	   A - B - X - C - D
	Where X is the memless node.  numa_mem_id() on X would return
	either B or C, right?  If B or C can't satisfy the allocation,
	the allocator would fallback to A from B and D for C, both of
	which aren't optimal. It should first fall back to C or B
	respectively, which the allocator can't do anymoe because the
	information is lost when the caller side performs numa_mem_id().

Hi Liu,

BTW, how is this A - B - X - C - D problem solved ?
I don't quite follow this.

I cannot tell the difference between numa_node_id()/cpu_to_node() and
numa_mem_id()/cpu_to_mem() on this point. Even with hardware topology
info, how could it avoid this problem ?

Isn't it still possible falling back to A from B and D for C ?

Thanks.

3) For memory allocation with __GFP_THISNODE setting in gfp_flags,
    numa_node_id()/cpu_to_node() should be used if caller only wants to
    allocate from local memory, otherwise numa_mem_id()/cpu_to_mem()
    should be used if caller wants to allocate from the nearest node
    with memory.
4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check
    whether a page is allocated from the nearest node.

Based on above rules, this patch set
1) Patch 1 is a bugfix to resolve a crash caused by socket hot-addition
2) Patch 2 replaces numa_mem_id() with numa_node_id() when __GFP_THISNODE
    isn't set in gfp_flags.
3) Patch 3-6 replaces numa_node_id()/cpu_to_node() with numa_mem_id()/
    cpu_to_mem() if caller wants to allocate from local node only.
4) Patch 7-9 enables support of memoryless node on x86.

With this patch set applied, on a system with two sockets enabled at boot,
one with memory and the other without memory, we got following numa
topology after boot:
root@bkd04sdp:~# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
node 0 size: 15940 MB
node 0 free: 15397 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
node 1 size: 0 MB
node 1 free: 0 MB
node distances:
node   0   1
   0:  10  21
   1:  21  10

After hot-adding the third socket without memory, we got:
root@bkd04sdp:~# numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
node 0 size: 15940 MB
node 0 free: 15142 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node distances:
node   0   1   2
   0:  10  21  21
   1:  21  10  21
   2:  21  21  10

Jiang Liu (9):
   x86, NUMA, ACPI: Online node earlier when doing CPU hot-addition
   kernel/profile.c: Replace cpu_to_mem() with cpu_to_node()
   sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless
     node
   openvswitch: Replace cpu_to_node() with cpu_to_mem() to support
     memoryless node
   i40e: Use numa_mem_id() to better support memoryless node
   i40evf: Use numa_mem_id() to better support memoryless node
   x86, numa: Kill useless code to improve code readability
   mm: Update _mem_id_[] for every possible CPU when memory
     configuration changes
   mm, x86: Enable memoryless node support to better support CPU/memory
     hotplug

  arch/x86/Kconfig                              |    3 ++
  arch/x86/kernel/acpi/boot.c                   |    9 +++-
  arch/x86/kernel/smpboot.c                     |    2 +
  arch/x86/mm/numa.c                            |   59 +++++++++++++++----------
  drivers/misc/sgi-xp/xpc_uv.c                  |    2 +-
  drivers/net/ethernet/intel/i40e/i40e_txrx.c   |    2 +-
  drivers/net/ethernet/intel/i40evf/i40e_txrx.c |    2 +-
  kernel/profile.c                              |    2 +-
  mm/page_alloc.c                               |   10 ++---
  net/openvswitch/flow.c                        |    2 +-
  10 files changed, 59 insertions(+), 34 deletions(-)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]