[ +cc for linux-mm mailinglist address ] On 09/24/2015 05:08 PM, Joe Lawrence wrote: > Hi Mel, Rik et al, > > We've encountered interesting NUMA balancing behavior on RHEL7.1, > reproduced with an upstream 4.2 kernel (of similar .config), that can > leave a user process trapped in the kernel performing task_numa_work. > > Our test group set up a server with 256GB memory running a program that > allocates and dirties ~50% of that memory. They reported the following > condition when they attempted to kill the test process -- the signal was > never handled, instead traces showed the task stuck here: > > PID: 36205 TASK: ffff887a692a8b60 CPU: 23 COMMAND: "memory_test_64" > [exception RIP: change_protection_range+0x4f8] > RIP: ffffffff8118a878 RSP: ffff887a777c3d68 RFLAGS: 00000282 > RAX: ffff887f24065141 RBX: ffff887a38e8ee58 RCX: 00000000d37a9780 > RDX: 80000034dea5e906 RSI: 00000006e9bcb000 RDI: 80000034dea5e906 > RBP: ffff887a777c3e60 R8: ffff887ae51f6948 R9: 0000000000000001 > R10: 0000000000000000 R11: ffff887f26f6d428 R12: 0000000000000000 > R13: 00000006e9c00000 R14: 8000000000000025 R15: 00000006e9bcb000 > CS: 0010 SS: 0018 > #0 [ffff887a777c3e68] change_protection at ffffffff8118abf5 > #1 [ffff887a777c3ea0] change_prot_numa at ffffffff811a106b > #2 [ffff887a777c3eb0] task_numa_work at ffffffff810adb23 > #3 [ffff887a777c3f00] task_work_run at ffffffff81093b37 > #4 [ffff887a777c3f30] do_notify_resume at ffffffff81013b0c > #5 [ffff887a777c3f50] retint_signal at ffffffff8160bafc > RIP: 00000000004025c4 RSP: 00007fff80aa5cf0 RFLAGS: 00000206 > RAX: 0000000ddaa64a60 RBX: 00000000000fdd00 RCX: 00000000000afbc0 > RDX: 00007fe45b6b6010 RSI: 000000000002dd50 RDI: 0000000ddaa36d10 > RBP: 00007fff80aa5d40 R8: 0000000000000000 R9: 000000000000007d > R10: 00007fff80aa5a70 R11: 0000000000000246 R12: 00007fe45b6b6010 > R13: 00007fff80aa5f30 R14: 0000000000000000 R15: 0000000000000000 > ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b > > A quick sanity check of the kernel .config and sysctl values: > > % grep NUMA .config > CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y > CONFIG_NUMA_BALANCING=y > CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y > # CONFIG_X86_NUMACHIP is not set > CONFIG_NUMA=y > CONFIG_AMD_NUMA=y > CONFIG_X86_64_ACPI_NUMA=y > # CONFIG_NUMA_EMU is not set > CONFIG_USE_PERCPU_NUMA_NODE_ID=y > CONFIG_ACPI_NUMA=y > > % sysctl -a | grep numa_balancing > kernel.numa_balancing = 1 > kernel.numa_balancing_scan_delay_ms = 1000 > kernel.numa_balancing_scan_period_max_ms = 60000 > kernel.numa_balancing_scan_period_min_ms = 1000 > kernel.numa_balancing_scan_size_mb = 256 > > A systemtap probe confirmed the task was indeed stuck in task_work_run, > with new task_work_add occurring before the previous task_numa_work had a > chance to return (prefix is current jiffies): > > 4555133534: kernel.function("task_numa_work@kernel/sched/fair.c:1796") > 4555133676: kernel.function("task_work_add@kernel/task_work.c:8") > 4555134412: kernel.function("task_numa_work@kernel/sched/fair.c:1796").return > 4555134412: kernel.function("task_numa_work@kernel/sched/fair.c:1796") > 4555134554: kernel.function("task_work_add@kernel/task_work.c:8") > 4555135291: kernel.function("task_numa_work@kernel/sched/fair.c:1796").return > 4555135291: kernel.function("task_numa_work@kernel/sched/fair.c:1796") > 4555135433: kernel.function("task_work_add@kernel/task_work.c:8") > 4555136167: kernel.function("task_numa_work@kernel/sched/fair.c:1796").return > 4555136167: kernel.function("task_numa_work@kernel/sched/fair.c:1796") > 4555136309: kernel.function("task_work_add@kernel/task_work.c:8") > > Looking at the implementation of task_work_run, it will continue to > churn as long as task->task_works will feed it. > > I did further systemtap investigation to watch the program find its way > into this condition. What I found was that the numa_scan_period_max was > dropping < 200. This was an effect of a ballooning MM_ANONPAGES value > (by way of task_nr_scan_windows() and task_scan_max()): > > [ ... shortly after program start ... ] > > numa_scan_period_max = 1621 task_nr_scan_windows = 39 MM_ANONPAGES = 2548181 > numa_scan_period_max = 1538 task_nr_scan_windows = 40 MM_ANONPAGES = 2574349 > numa_scan_period_max = 1538 task_nr_scan_windows = 40 MM_ANONPAGES = 2599734 > > [ ... snip about 20 minutes of data... ] > > numa_scan_period_max = 119 task_nr_scan_windows = 503 MM_ANONPAGES = 32956990 > numa_scan_period_max = 119 task_nr_scan_windows = 503 MM_ANONPAGES = 32958955 > numa_scan_period_max = 119 task_nr_scan_windows = 503 MM_ANONPAGES = 32960104 > numa_scan_period_max = 119 task_nr_scan_windows = 503 MM_ANONPAGES = 32960104 > > update_task_scan_period will assign the numa_scan_period to the minimum > of numa_scan_period_max and numa_scan_period * 2. As > numa_scan_period_max decreases, it will be the smaller value and hence > the numa_next_scan's get closer and closer. > > Looking back through the changelog, commit 598f0ec0bc99 "sched/numa: > Set the scan rate proportional to the memory usage of the task being > scanned" changed numa_balancing_scan_period_max semantics to tune the > length of time to complete a full scan. This may introduce the > possibility of falling into this condition, though I'm not 100% sure. > > Let me know if there is any additional data that would be helpful to > report. In the meantime, I've run a few hours with the following > workaround to hold off the task_numa_work grind. > > Regards, > > -- Joe > > -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 3d6baa7d4534..df34df492949 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -467,6 +467,11 @@ struct mm_struct { > */ > unsigned long numa_next_scan; > > + /* numa_last_work_time is the jiffy runtime of the previous > + * task_numa_work invocation, providing hysteresis for numa_next_scan > + * so it will be at least this many jiffies in the future. */ > + unsigned long numa_last_work_time; > + > /* Restart point for scanning and setting pte_numa */ > unsigned long numa_scan_offset; > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 6e2e3483b1ec..16a96297e2a3 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1651,7 +1651,7 @@ static void update_task_scan_period(struct task_struct *p, > p->numa_scan_period = min(p->numa_scan_period_max, > p->numa_scan_period << 1); > > - p->mm->numa_next_scan = jiffies + > + p->mm->numa_next_scan = jiffies + p->mm->numa_last_work_time + > msecs_to_jiffies(p->numa_scan_period); > > return; > @@ -2269,6 +2269,9 @@ out: > mm->numa_scan_offset = start; > else > reset_ptenuma_scan(p); > + > + mm->numa_last_work_time = jiffies - now; > + > up_read(&mm->mmap_sem); > } > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>