On Wed, 13 Nov 2019 11:45:59 +0800 王贇 <yun.wang@xxxxxxxxxxxxxxxxx> wrote: > Add the description for 'cg_numa_stat', also a new doc to explain > the details on how to deal with the per-cgroup numa statistics. > > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Cc: Michal Koutný <mkoutny@xxxxxxxx> > Cc: Mel Gorman <mgorman@xxxxxxx> > Signed-off-by: Michael Wang <yun.wang@xxxxxxxxxxxxxxxxx> > --- > Documentation/admin-guide/cg-numa-stat.rst | 161 ++++++++++++++++++++++++ > Documentation/admin-guide/kernel-parameters.txt | 4 + > Documentation/admin-guide/sysctl/kernel.rst | 9 ++ > 3 files changed, 174 insertions(+) > create mode 100644 Documentation/admin-guide/cg-numa-stat.rst Thanks for adding documentation for your new feature! When you add a new RST file, though, you should also add it to index.rst so that it becomes a part of the docs build. A couple of nits below... > diff --git a/Documentation/admin-guide/cg-numa-stat.rst b/Documentation/admin-guide/cg-numa-stat.rst > new file mode 100644 > index 000000000000..87b716c51e16 > --- /dev/null > +++ b/Documentation/admin-guide/cg-numa-stat.rst > @@ -0,0 +1,161 @@ > +=============================== > +Per-cgroup NUMA statistics > +=============================== > + > +Background > +---------- > + > +On NUMA platforms, remote memory accessing always has a performance penalty, > +although we have NUMA balancing working hard to maximum the local accessing > +proportion, there are still situations it can't helps. > + > +This could happen in modern production environment, using bunch of cgroups > +to classify and control resources which introduced complex configuration on > +memory policy, CPUs and NUMA node, NUMA balancing could facing the wrong > +memory policy or exhausted local NUMA node, lead into the low local page > +accessing proportion. > + > +We need to perceive such cases, figure out which workloads from which cgroup > +has introduced the issues, then we got chance to do adjustment to avoid > +performance damages. > + > +However, there are no hardware counter for per-task local/remote accessing > +info, we don't know how many remote page accessing has been done for a > +particular task. > + > +Statistics > +---------- > + > +Fortunately, we have NUMA Balancing which scan task's mapping and trigger PF > +periodically, give us the opportunity to record per-task page accessing info. > + > +By "echo 1 > /proc/sys/kernel/cg_numa_stat" on runtime or add boot parameter > +'cg_numa_stat', we will enable the accounting of per-cgroup numa statistics, > +the 'cpu.numa_stat' entry of CPU cgroup will show statistics: > + > + locality -- execution time sectioned by task NUMA locality (in ms) > + exectime -- execution time sectioned by NUMA node (in ms) > + > +We define 'task NUMA locality' as: > + > + nr_local_page_access * 100 / (nr_local_page_access + nr_remote_page_access) > + > +this per-task percentage value will be updated on the ticks for current task, > +and the access counter will be updated on task's NUMA balancing PF, so only > +the pages which NUMA Balancing paid attention to will be accounted. > + > +On each tick, we acquire the locality of current task on that CPU, accumulating > +the ticks into the counter of corresponding locality region, tasks from the > +same group sharing the counters, becoming the group locality. > + > +Similarly, we acquire the NUMA node of current CPU where the current task is > +executing on, accumulating the ticks into the counter of corresponding node, > +becoming the per-cgroup node execution time. > + > +To be noticed, the accounting is in a hierarchy way, which means the numa > +statistics representing not only the workload of this group, but also the > +workloads of all it's descendants. > + > +For example the 'cpu.numa_stat' show: > + locality 39541 60962 36842 72519 118605 721778 946553 > + exectime 1220127 1458684 You almost certainly want that rendered as a literal block, so say "show::". There are other places where you'll want to do that as well. > +The locality is sectioned into 7 regions, closely as: > + 0-13% 14-27% 28-42% 43-56% 57-71% 72-85% 86-100% > + > +And exectime is sectioned into 2 nodes, 0 and 1 in this case. > + > +Thus we know the workload of this group and it's descendants have totally > +executed 1220127ms on node_0 and 1458684ms on node_1, tasks with locality > +around 0~13% executed for 39541 ms, and tasks with locality around 87~100% > +executed for 946553 ms, which imply most of the memory access are local. > + > +Monitoring > +----------------- A slightly long underline :) I'll stop here; thanks again for adding documentation. jon