Modern production environment could use hundreds of cgroup to control the resources for different workloads, along with the complicated resource binding. On NUMA platforms where we have multiple nodes, things become even more complicated, we hope there are more local memory access to improve the performance, and NUMA Balancing keep working hard to achieve that, however, wrong memory policy or node binding could easily waste the effort, result a lot of remote page accessing. We need to notice such problems, then we got chance to fix it before there are too much damages.This patch set is a document to introduce the per-task and per-cgroup NUMA locality info, with these statistics, we could achieve the daily monitoring on NUMA efficiency, to give warning when things going too wrong. Signed-off-by: Michael Wang <yun.wang@xxxxxxxxxxxxxxxxx> Signed-off-by: Tianchen Ding <tianchen.dingtianc@xxxxxxxxxxxxxxx> --- Documentation/admin-guide/numa-locality.rst | 208 ++++++++++++++++++++ 1 file changed, 208 insertions(+) create mode 100644 Documentation/admin-guide/numa-locality.rst diff --git a/Documentation/admin-guide/numa-locality.rst b/Documentation/admin-guide/numa-locality.rst new file mode 100644 index 000000000000..1dee6bc70b7f --- /dev/null +++ b/Documentation/admin-guide/numa-locality.rst @@ -0,0 +1,208 @@ +SPDX-License-Identifier: GPL-2.0 + +===================== +NUMA Locality measure +===================== + +Background +---------- + +On NUMA platforms, remote memory accessing always has a performance penalty. +Although we have NUMA balancing working hard to maximize the access locality, +there are still situations it can't help. + +This could happen in modern production environment. When a large number of +cgroups are used to classify and control resources, this creates a complex +configuration for memory policy, CPUs and NUMA nodes. In such cases NUMA +balancing could end up with the wrong memory policy or exhausted local NUMA +node, which would lead to low percentage of local page accesses. + +We need to detect such cases, figure out which workloads from which cgroup have +introduced the issues, then we get chance to do adjustment to avoid performance +degradation. + +However, there are no hardware counters for per-task local/remote accessing +info, we don't know how many remote page accesses have occurred for a particular +task. + +NUMA Locality +------------- + +Fortunately, we have NUMA Balancing which scans task's mapping and triggers page +fault periodically, giving us the opportunity to record per-task page accessing +info. When the CPU fall into PF is from the same node of pages, we consider task +as doing local page accessing, otherwise the remote page accessing. These page +faults are recorded and available in /proc/vmstat for global information and +/proc/PID/sched for a specific task. + +For global information, you could read 'numa_hint_faults' and +'numa_hint_faults_local' in /proc/vmstat. They record how many NUMA hinting +faults are trapped on all nodes and on local nodes of the tasks, separately. +The percentage of local faults could be calculated, which shows the global +NUMA locality. + +Here are meanings of parameters in /proc/vmstat related to NUMA balancing: + +====================== ============================================================ +numa_pte_updates Number of pages which are marked to be inaccessible. + These are later cleared by a NUMA hinting fault. +numa_huge_pte_updates Similar to numa_pte_updates, recording number of huge pages. +numa_hint_faults Times of NUMA hinting faults being trapped on all nodes. +numa_hint_faults_local Times of NUMA hinting faults being trapped on local nodes. +numa_pages_migrated Number of misplaced pages being migrated to the specified + destination successfully. +====================== ============================================================ + +For a specific task, 'numa_faults' in /proc/PID/sched could be helpful. It lists +the numbers of pages with NUMA hinting faults of the task itself and its NUMA +group on each node after migration (if exists). These numbers are in exponential +decaying average. For example, a task is assigned on node0. If the numbers of +'task_private' and 'task_shared' on the other nodes are relatively smaller than +the numbers on node0, this task is working with high locality. + +Here are meanings of parameters in /proc/PID/sched related to NUMA balancing: + +=================== ============================================================== +mm->numa_scan_seq Times of NUMA balancing program scanning. +numa_pages_migrated Number of misplaced pages being migrated to the specified + destination successfully. +numa_preferred_nid The task's preferred NUMA node. NUMA balancing program would + try to migrate misplaced pages to this node first. +total_numa_faults Number of pages trapping NUMA hinting faults on all nodes. + Equals the sum of 'task_private' and 'task_shared' on all + nodes in 'numa_faults'. So this number is also in + exponential decaying average. +current_node The NUMA node which the CPU running this task is located on. +numa_group_id Tasks running on the same node would be automatically grouped. + Usually the PID of the biggest task. +numa_faults A list of NUMA faults conditions on all nodes. Recording + numbers of pages trapping NUMA hinting faults in private + memory and shared memory of this task and its group on + each node, separately. These numbers are in exponential + decaying average. +=================== ============================================================== + +To achieve the NUMA locality info of a cgroup, we could fetch all PIDs in the +cgroup and gather the data. For example, we measure the locality of a cgroup +consisting of two tasks (with PID 1001 and 1002) by calculating the sum of +their local NUMA faults and the sum of their total NUMA faults from +/proc/1001/sched and /proc/1002/sched. + +NUMA Consumption +---------------- + +There are also other cgroup entries which help us to estimate NUMA efficiency. +They are 'cpuacct.usage_percpu' and 'memory.numa_stat'. + +By reading 'cpuacct.usage_percpu' we will get per-cpu runtime (in nanoseconds) +info (in hierarchy) as: + + CPU_0_RUNTIME CPU_1_RUNTIME CPU_2_RUNTIME ... CPU_X_RUNTIME + +Combined with the info from: + + cat /sys/devices/system/node/nodeX/cpulist + +We would be able to accumulate the runtime of CPUs into NUMA nodes, to get the +per-cgroup node runtime info. + +By reading 'memory.numa_stat' we will get per-cgroup node memory consumption +info as: + + total=TOTAL_MEM N0=MEM_ON_NODE0 N1=MEM_ON_NODE1 ... NX=MEM_ON_NODEX + +Together we call these the per-cgroup NUMA consumption info, telling us how many +resources a particular workload has consumed, on a particular NUMA node. + +Monitoring +---------- + +By monitoring the change of /proc/PID/sched, we can easily know whether NUMA +Balancing is working well for a particular workload. + +We can take sample with higher rate than the maximum scan rate of NUMA +balancing, and check 'mm->numa_scan_seq'. If it is increased by one, +the updated data would be recorded, and we have: + + local_diff = (task_private + task_shared) - (last_task_private + last_task_shared) / 2 + total_diff = total_numa_faults – last_total_numa_faults / 2 + +Since data in /proc/PID/sched are in exponential decaying average, we calculate +diff in this way to get absolute page numbers. Here 'task_private' and +'task_shared' refers to the task's CPU node. If a task is running on multiple +NUMA nodes, this param may not be accurate. + +We get the locality in this NUMA balancing scanning period as: + + locality = local_diff * 100 / total_diff + +We can plot a line for locality. When the line is close to 100%, things are good; +when getting close to 0% something is wrong. We can pick a proper watermark to +trigger warning message. + +You may want to drop the data if the total_diff is too small, which implies +there are not many available pages for NUMA Balancing to scan, ignoring would +be fine since most likely the workload is insensitive to NUMA, or the memory +topology is already good enough. + +Furthermore, eBPF could be applied to help record stats. We could trace +'task_numa_fault' function. When this function is called, the params like memory +node and page number would be collected and handled. This method allows us to +monitor a task’s locality without keeping fetching /proc/PID/sched at a high rate. + +Monitoring root group helps you control the overall situation, while you may +also want to monitor all the leaf groups which contain the workloads, +this helps to catch the mouse. + +Try to put your workload into also the cpuacct & memory cgroup, when NUMA +Balancing is disabled or locality becomes too small, we may want to monitor +the per-node runtime & memory info to see if the node consumption meet +the requirements. + +For NUMA node X on each sampling we have: + + runtime_X_diff = runtime_X - last_runtime_X + runtime_all_diff = runtime_all - last_runtime_all + + runtime_percent_X = runtime_X_diff * 100 / runtime_all_diff + memory_percent_X = memory_X * 100 / memory_all + +These two percentages are usually matched on each node, workload should execute +mostly on the node that contains most of its memory, but it's not guaranteed. + +The workload may only access a small part of its memory, in such cases although +the majority of memory are remote, locality could still be good. + +Thus to tell if things are fine or not depends on the understanding of system +resource deployment, however, if you find node X got 100% memory percent but +0% runtime percent, definitely something is wrong. + +Troubleshooting +--------------- + +After identifying which workload introduced the bad locality, check: + +1). Is the workload bound to a particular NUMA node? +2). Has any NUMA node run out of resources? + +There are several ways to bind task's memory with a NUMA node, the strict way +like the MPOL_BIND memory policy or 'cpuset.mems' will limit the memory node +where to allocate pages. In this situation, admin should make sure the task is +allowed to run on the CPUs of that NUMA node, and make sure there are available +CPU resources there. + +There are also ways to bind task's CPU with a NUMA node, like 'cpuset.cpus' or +sched_setaffinity() syscall. In this situation, NUMA Balancing helps to migrate +pages into that node, admin should make sure there is available memory there. + +Admin could try to rebind or unbind the NUMA node to erase the damage, +make a change then observe the statistics to see if things get better +until the situation is acceptable. + +Highlights +---------- + +For some tasks, NUMA Balancing may be found to be unnecessary to scan pages, +and locality could always be 0 or small number, don't pay attention to them +since they most likely insensitive to NUMA. + -- 2.25.1