+ mm-memory_hotplug-do-not-clear-numa_node-association-after-hot_remove.patch added to -mm tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The patch titled
     Subject: mm, memory_hotplug: do not clear numa_node association after hot_remove
has been added to the -mm tree.  Its filename is
     mm-memory_hotplug-do-not-clear-numa_node-association-after-hot_remove.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-memory_hotplug-do-not-clear-numa_node-association-after-hot_remove.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-memory_hotplug-do-not-clear-numa_node-association-after-hot_remove.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Michal Hocko <mhocko@xxxxxxxx>
Subject: mm, memory_hotplug: do not clear numa_node association after hot_remove

Per-cpu numa_node provides a default node for each possible cpu.  The
association gets initialized during the boot when the architecture
specific code explores cpu->NUMA affinity.  When the whole NUMA node is
removed though we are clearing this association

try_offline_node
  check_and_unmap_cpu_on_node
    unmap_cpu_on_node
      numa_clear_node
        numa_set_node(cpu, NUMA_NO_NODE)

This means that whoever calls cpu_to_node for a cpu associated with such a
node will get NUMA_NO_NODE.  This is problematic for two reasons.  First
it is fragile because __alloc_pages_node would simply blow up on an
out-of-bound access.  We have encountered this when loading kvm module

BUG: unable to handle kernel paging request at 00000000000021c0
IP: [<ffffffff8119ccb3>] __alloc_pages_nodemask+0x93/0xb70
PGD 800000ffe853e067 PUD 7336bbc067 PMD 0
Oops: 0000 [#1] SMP
[...]
CPU: 88 PID: 1223749 Comm: modprobe Tainted: G        W          4.4.156-94.64-default #1
task: ffff88727eff1880 ti: ffff887354490000 task.ti: ffff887354490000
RIP: 0010:[<ffffffff8119ccb3>]  [<ffffffff8119ccb3>] __alloc_pages_nodemask+0x93/0xb70
RSP: 0018:ffff887354493b40  EFLAGS: 00010202
RAX: 00000000000021c0 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000014000c0
RBP: 00000000014000c0 R08: ffffffffffffffff R09: 0000000000000000
R10: ffff88fffc89e790 R11: 0000000000014000 R12: 0000000000000101
R13: ffffffffa0772cd4 R14: ffffffffa0769ac0 R15: 0000000000000000
FS:  00007fdf2f2f1700(0000) GS:ffff88fffc880000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000021c0 CR3: 00000077205ee000 CR4: 0000000000360670
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
 0000000000000086 014000c014d20400 ffff887354493bb8 ffff882614d20f4c
 0000000000000000 0000000000000046 0000000000000046 ffffffff810ac0c9
 ffff88ffe78c0000 ffffffff0000009f ffffe8ffe82d3500 ffff88ff8ac55000
Call Trace:
 [<ffffffffa07476cd>] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
 [<ffffffffa0772c0c>] hardware_setup+0x781/0x849 [kvm_intel]
 [<ffffffffa04a1c58>] kvm_arch_hardware_setup+0x28/0x190 [kvm]
 [<ffffffffa04856fc>] kvm_init+0x7c/0x2d0 [kvm]
 [<ffffffffa0772cf2>] vmx_init+0x1e/0x32c [kvm_intel]
 [<ffffffff8100213a>] do_one_initcall+0xca/0x1f0
 [<ffffffff81193886>] do_init_module+0x5a/0x1d7
 [<ffffffff81112083>] load_module+0x1393/0x1c90
 [<ffffffff81112b30>] SYSC_finit_module+0x70/0xa0
 [<ffffffff8161cbc3>] entry_SYSCALL_64_fastpath+0x1e/0xb7
DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7

on an older kernel but the code is basically the same in the current Linus
tree as well.  alloc_vmcs_cpu could use alloc_pages_nodemask which would
recognize NUMA_NO_NODE and use alloc_pages_node which would translate it
to numa_mem_id but that is wrong as well because it would use a cpu
affinity of the local CPU which might be quite far from the original node.
It is also reasonable to expect that cpu_to_node will provide a sane
value and there might be many more callers like that.

The second problem is that __register_one_node relies on cpu_to_node to
properly associate cpus back to the node when it is onlined.  We do not
want to lose that link as there is no arch independent way to get it from
the early boot time AFAICS.

Drop the whole check_and_unmap_cpu_on_node machinery and keep the
association to fix both issues.  The NODE_DATA(nid) is not deallocated so
it will stay in place and if anybody wants to allocate from that node then
a fallback node will be used.

Thanks to Vlastimil Babka for his live system debugging skills that helped
debugging the issue.

Link: http://lkml.kernel.org/r/20181108100413.966-1-mhocko@xxxxxxxxxx
Fixes: e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node")
Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
Debugged-by: Vlastimil Babka <vbabka@xxxxxxx>
Reported-by: Miroslav Benes <mbenes@xxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/memory_hotplug.c |   30 +-----------------------------
 1 file changed, 1 insertion(+), 29 deletions(-)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-do-not-clear-numa_node-association-after-hot_remove
+++ a/mm/memory_hotplug.c
@@ -1763,34 +1763,6 @@ static int check_cpu_on_node(pg_data_t *
 	return 0;
 }
 
-static void unmap_cpu_on_node(pg_data_t *pgdat)
-{
-#ifdef CONFIG_ACPI_NUMA
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		if (cpu_to_node(cpu) == pgdat->node_id)
-			numa_clear_node(cpu);
-#endif
-}
-
-static int check_and_unmap_cpu_on_node(pg_data_t *pgdat)
-{
-	int ret;
-
-	ret = check_cpu_on_node(pgdat);
-	if (ret)
-		return ret;
-
-	/*
-	 * the node will be offlined when we come here, so we can clear
-	 * the cpu_to_node() now.
-	 */
-
-	unmap_cpu_on_node(pgdat);
-	return 0;
-}
-
 /**
  * try_offline_node
  * @nid: the node ID
@@ -1823,7 +1795,7 @@ void try_offline_node(int nid)
 		return;
 	}
 
-	if (check_and_unmap_cpu_on_node(pgdat))
+	if (check_cpu_on_node(pgdat))
 		return;
 
 	/*
_

Patches currently in -mm which might be from mhocko@xxxxxxxx are

mm-memory_hotplug-check-zone_movable-in-has_unmovable_pages.patch
mm-print-more-information-about-mapping-in-__dump_page.patch
mm-lower-the-printk-loglevel-for-__dump_page-messages.patch
mm-memory_hotplug-drop-pointless-block-alignment-checks-from-__offline_pages.patch
mm-memory_hotplug-print-reason-for-the-offlining-failure.patch
mm-memory_hotplug-be-more-verbose-for-memory-offline-failures.patch
mm-memory_hotplug-do-not-clear-numa_node-association-after-hot_remove.patch




[Index of Archives]     [Kernel Archive]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]

  Powered by Linux