On Mon, 24 Feb 2020 13:30:44 +0100 SeongJae Park <sjpark@xxxxxxxxxx> wrote: > From: SeongJae Park <sjpark@xxxxxxxxx> > > This commit adds a simple document for DAMON under > `Documentation/admin-guide/mm`. > Nice document to get people started. Certainly worked for me doing some initial playing around. In general this is an interesting piece of work. I can see there are numerous possible avenues to explore in making the monitoring more flexible, or potentially better at tracking usage whilst not breaking your fundamental 'bounded overhead' requirement. Will be fun perhaps to explore some of those. I'll do some more exploring and perhaps try some real world workloads. Thanks, Jonathan > Signed-off-by: SeongJae Park <sjpark@xxxxxxxxx> > --- > .../admin-guide/mm/data_access_monitor.rst | 414 ++++++++++++++++++ > Documentation/admin-guide/mm/index.rst | 1 + > 2 files changed, 415 insertions(+) > create mode 100644 Documentation/admin-guide/mm/data_access_monitor.rst > > diff --git a/Documentation/admin-guide/mm/data_access_monitor.rst b/Documentation/admin-guide/mm/data_access_monitor.rst > new file mode 100644 > index 000000000000..4d836c3866e2 > --- /dev/null > +++ b/Documentation/admin-guide/mm/data_access_monitor.rst > @@ -0,0 +1,414 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +========================== > +DAMON: Data Access MONitor > +========================== > + > +Introduction > +============ > + > +Memory management decisions can normally be more efficient if finer data access > +information is available. However, because finer information usually comes > +with higher overhead, most systems including Linux made a tradeoff: Forgive > +some wise decisions and use coarse information and/or light-weight heuristics. I'm not sure what "Forgive some wise decisions" means... > + > +A number of experimental data access pattern awared memory management > +optimizations say the sacrifices are > +huge (2.55x slowdown). Good to have a reference. > However, none of those has successfully adopted to adopted into the > +Linux kernel mainly due to the absence of a scalable and efficient data access > +monitoring mechanism. > + > +DAMON is a data access monitoring solution for the problem. It is 1) accurate > +enough for the DRAM level memory management, 2) light-weight enough to be > +applied online, and 3) keeps predefined upper-bound overhead regardless of the > +size of target workloads (thus scalable). > + > +DAMON is implemented as a standalone kernel module and provides several simple > +interfaces. Owing to that, though it has mainly designed for the kernel's > +memory management mechanisms, it can be also used for a wide range of user > +space programs and people. > + > + > +Frequently Asked Questions > +========================== > + > +Q: Why not integrated with perf? > +A: From the perspective of perf like profilers, DAMON can be thought of as a > +data source in kernel, like tracepoints, pressure stall information (psi), or > +idle page tracking. Thus, it can be easily integrated with those. However, > +this patchset doesn't provide a fancy perf integration because current step of > +DAMON development is focused on its core logic only. That said, DAMON already > +provides two interfaces for user space programs, which based on debugfs and > +tracepoint, respectively. Using the tracepoint interface, you can use DAMON > +with perf. This patchset also provides the debugfs interface based user space > +tool for DAMON. It can be used to record, visualize, and analyze data access > +pattern of target processes in a convenient way. > + > +Q: Why a new module, instead of extending perf or other tools? > +A: First, DAMON aims to be used by other programs including the kernel. > +Therefore, having dependency to specific tools like perf is not desirable. > +Second, because it need to be lightweight as much as possible so that it can be > +used online, any unnecessary overhead such as kernel - user space context > +switching cost should be avoided. These are the two most biggest reasons why > +DAMON is implemented in the kernel space. The idle page tracking subsystem > +would be the kernel module that most seems similar to DAMON. However, it's own > +interface is not compatible with DAMON. Also, the internal implementation of > +it has no common part to be reused by DAMON. > + > +Q: Can 'perf mem' provide the data required for DAMON? > +A: On the systems supporting 'perf mem', yes. DAMON is using the PTE Accessed > +bits in low level. Other H/W or S/W features that can be used for the purpose > +could be used. However, as explained with above question, DAMON need to be > +implemented in the kernel space. > + > + > +Expected Use-cases > +================== > + > +A straightforward usecase of DAMON would be the program behavior analysis. > +With the DAMON output, users can confirm whether the program is running as > +intended or not. This will be useful for debuggings and tests of design > +points. > + > +The monitored results can also be useful for counting the dynamic working set > +size of workloads. For the administration of memory overcommitted systems or > +selection of the environments (e.g., containers providing different amount of > +memory) for your workloads, this will be useful. > + > +If you are a programmer, you can optimize your program by managing the memory > +based on the actual data access pattern. For example, you can identify the > +dynamic hotness of your data using DAMON and call ``mlock()`` to keep your hot > +data in DRAM, or call ``madvise()`` with ``MADV_PAGEOUT`` to proactively > +reclaim cold data. Even though your program is guaranteed to not encounter > +memory pressure, you can still improve the performance by applying the DAMON > +outputs for call of ``MADV_HUGEPAGE`` and ``MADV_NOHUGEPAGE``. More creative > +optimizations would be possible. Our evaluations of DAMON includes a > +straightforward optimization using the ``mlock()``. Please refer to the below > +Evaluation section for more detail. > + > +As DAMON incurs very low overhead, such optimizations can be applied not only > +offline, but also online. Also, there is no reason to limit such optimizations > +to the user space. Several parts of the kernel's memory management mechanisms > +could be also optimized using DAMON. The reclamation, the THP (de)promotion > +decisions, and the compaction would be such a candidates. > + > + > +Mechanisms of DAMON > +=================== > + > + > +Basic Access Check > +------------------ > + > +DAMON basically reports what pages are how frequently accessed. The report is > +passed to users in binary format via a ``result file`` which users can set it's > +path. Note that the frequency is not an absolute number of accesses, but a > +relative frequency among the pages of the target workloads. > + > +Users can also control the resolution of the reports by setting two time > +intervals, ``sampling interval`` and ``aggregation interval``. In detail, > +DAMON checks access to each page per ``sampling interval``, aggregates the > +results (counts the number of the accesses to each page), and reports the > +aggregated results per ``aggregation interval``. For the access check of each > +page, DAMON uses the Accessed bits of PTEs. > + > +This is thus similar to the previously mentioned periodic access checks based > +mechanisms, which overhead is increasing as the size of the target process > +grows. > + > + > +Region Based Sampling > +--------------------- > + > +To avoid the unbounded increase of the overhead, DAMON groups a number of > +adjacent pages that assumed to have same access frequencies into a region. As > +long as the assumption (pages in a region have same access frequencies) is > +kept, only one page in the region is required to be checked. Thus, for each > +``sampling interval``, DAMON randomly picks one page in each region and clears > +its Accessed bit. After one more ``sampling interval``, DAMON reads the > +Accessed bit of the page and increases the access frequency of the region if > +the bit has set meanwhile. Therefore, the monitoring overhead is controllable > +by setting the number of regions. DAMON allows users to set the minimal and > +maximum number of regions for the trade-off. > + > +Except the assumption, this is almost same with the above-mentioned > +miniature-like static region based sampling. In other words, this scheme > +cannot preserve the quality of the output if the assumption is not guaranteed. > + > + > +Adaptive Regions Adjustment > +--------------------------- > + > +At the beginning of the monitoring, DAMON constructs the initial regions by > +evenly splitting the memory mapped address space of the process into the > +user-specified minimal number of regions. In this initial state, the > +assumption is normally not kept and thus the quality could be low. To keep the > +assumption as much as possible, DAMON adaptively merges and splits each region. > +For each ``aggregation interval``, it compares the access frequencies of > +adjacent regions and merges those if the frequency difference is small. Then, > +after it reports and clears the aggregated access frequency of each region, it > +splits each region into two regions if the total number of regions is smaller > +than the half of the user-specified maximum number of regions. > + > +In this way, DAMON provides its best-effort quality and minimal overhead while > +keeping the bounds users set for their trade-off. > + > + > +Applying Dynamic Memory Mappings > +-------------------------------- > + > +Only a number of small parts in the super-huge virtual address space of the > +processes is mapped to physical memory and accessed. Thus, tracking the > +unmapped address regions is just wasteful. However, tracking every memory > +mapping change might incur an overhead. For the reason, DAMON applies the > +dynamic memory mapping changes to the tracking regions only for each of an > +user-specified time interval (``regions update interval``). One key part of the approach is the 3 region bit. Perhaps talk about that here somewhere? > + > + > +``debugfs`` Interface > +===================== > + > +DAMON exports four files, ``attrs``, ``pids``, ``record``, and ``monitor_on`` > +under its debugfs directory, ``<debugfs>/damon/``. > + > +Attributes > +---------- > + > +Users can read and write the ``sampling interval``, ``aggregation interval``, > +``regions update interval``, and min/max number of monitoring target regions by > +reading from and writing to the ``attrs`` file. For example, below commands > +set those values to 5 ms, 100 ms, 1,000 ms, 10, 1000 and check it again:: > + > + # cd <debugfs>/damon > + # echo 5000 100000 1000000 10 1000 > attrs I'm personally a great fan of human readable interfaces. Could we just split this into one file per interval? That way the file naming would make it self describing. > + # cat attrs > + 5000 100000 1000000 10 1000 > + > +Target PIDs > +----------- > + > +Users can read and write the pids of current monitoring target processes by > +reading from and writing to the ``pids`` file. For example, below commands set > +processes having pids 42 and 4242 as the processes to be monitored and check it > +again:: > + > + # cd <debugfs>/damon > + # echo 42 4242 > pids > + # cat pids > + 42 4242 > + > +Note that setting the pids doesn't starts the monitoring. > + > +Record > +------ > + > +DAMON support direct monitoring result record feature. The recorded results > +are first written to a buffer and flushed to a file in batch. Users can set > +the size of the buffer and the path to the result file by reading from and > +writing to the ``record`` file. For example, below commands set the buffer to > +be 4 KiB and the result to be saved in ``/damon.data``. > + > + # cd <debugfs>/damon > + # echo "4096 /damon.data" > pids write it to record, not pids. > + # cat record > + 4096 /damon.data > + > +Turning On/Off > +-------------- > + > +You can check current status, start and stop the monitoring by reading from and > +writing to the ``monitor_on`` file. Writing ``on`` to the file starts DAMON to > +monitor the target processes with the attributes. Writing ``off`` to the file > +stops DAMON. DAMON also stops if every target processes is be terminated. > +Below example commands turn on, off, and check status of DAMON:: > + > + # cd <debugfs>/damon > + # echo on > monitor_on > + # echo off > monitor_on > + # cat monitor_on > + off > + > +Please note that you cannot write to the ``attrs`` and ``pids`` files while the > +monitoring is turned on. If you write to the files while DAMON is running, > +``-EINVAL`` will be returned. Perhaps -EBUSY would be more informative? Implies values might be fine, but the issue is 'not now'. > + > + > +User Space Tool for DAMON > +========================= > + > +There is a user space tool for DAMON, ``/tools/damon/damo``. It provides > +another user interface which more convenient than the debugfs interface. > +Nevertheless, note that it is only aimed to be used for minimal reference of > +the DAMON's debugfs interfaces and for tests of the DAMON itself. Based on the > +debugfs interface, you can create another cool and more convenient user space > +tools. > + > +The interface of the tool is basically subcommand based. You can almost always > +use ``-h`` option to get help of the use of each subcommand. Currently, it > +supports two subcommands, ``record`` and ``report``. > + > + > +Recording Data Access Pattern > +----------------------------- > + > +The ``record`` subcommand records the data access pattern of target process in > +a file (``./damon.data`` by default) using DAMON. You can specifies the target > +as either pid or a command for an execution of the process. Below example > +shows a command target usage:: > + > + # cd <kernel>/tools/damon/ > + # ./damo record "sleep 5" > + > +The tool will execute ``sleep 5`` by itself and record the data access patterns > +of the process. Below example shows a pid target usage:: > + > + # sleep 5 & > + # ./damo record `pidof sleep` > + > +You can set more detailed attributes and path to the recorded data file using > +optional arguments to the subcommand. Use the ``-h`` option for more help. > + > + > +Analyzing Data Access Pattern > +----------------------------- > + > +The ``report`` subcommand reads a data access pattern record file (if not > +explicitly specified, reads ``./damon.data`` file if exists) and generates > +reports of various types. You can specify what type of report you want using > +sub-subcommand to ``report`` subcommand. For supported types, pass the ``-h`` > +option to ``report`` subcommand. > + > + > +raw > +~~~ > + > +``raw`` sub-subcommand simply transforms the record, which is storing the data > +access patterns in binary format to human readable text. For example:: > + > + $ ./damo report raw > + start_time: 193485829398 > + rel time: 0 > + nr_tasks: 1 > + pid: 1348 > + nr_regions: 4 > + 560189609000-56018abce000( 22827008): 0 > + 7fbdff59a000-7fbdffaf1a00( 5601792): 0 > + 7fbdffaf1a00-7fbdffbb5000( 800256): 1 > + 7ffea0dc0000-7ffea0dfd000( 249856): 0 > + > + rel time: 100000731 > + nr_tasks: 1 > + pid: 1348 > + nr_regions: 6 > + 560189609000-56018abce000( 22827008): 0 > + 7fbdff59a000-7fbdff8ce933( 3361075): 0 > + 7fbdff8ce933-7fbdffaf1a00( 2240717): 1 > + 7fbdffaf1a00-7fbdffb66d99( 480153): 0 > + 7fbdffb66d99-7fbdffbb5000( 320103): 1 > + 7ffea0dc0000-7ffea0dfd000( 249856): 0 > + > +The first line shows recording started timestamp (nanosecond). Records of data > +access patterns are following this. Each record is sperated by a blank line. > +Each record first specifies the recorded time (``rel time``), number of > +monitored tasks in this record (``nr_tasks``). Multiple number of records of > +data access pattern for each task continue. Each data access pattern for each > +task shows first it's pid (``pid``) and number of monitored virtual address > +regions in this access pattern (``nr_regions``). After that, each line shows > +start/end address, size, and number of monitored accesses to the region for > +each of the regions. > + > + > +heats > +~~~~~ > + > +The ``raw`` type shows detailed information but it is exhaustive to manually > +read and analyzed. For the reason, ``heats`` plots the data in heatmap form, > +using time as x-axis, virtual address as y-axis, and access frequency as > +z-axis. Also, users set the resolution and start/end point of each axis via > +optional arguments. For example:: > + > + $ ./damo report heats --tres 3 --ares 3 > + 0 0 0.0 > + 0 7609002 0.0 > + 0 15218004 0.0 > + 66112620851 0 0.0 > + 66112620851 7609002 0.0 > + 66112620851 15218004 0.0 > + 132225241702 0 0.0 > + 132225241702 7609002 0.0 > + 132225241702 15218004 0.0 > + > +This command shows the recorded access pattern of the ``sleep`` command using 3 > +data points for each of time axis and address axis. Therefore, it shows 9 data > +points in total. > + > +Users can easily converts this text output into heatmap image or other 3D > +representation using various tools such as 'gnuplot'. ``raw`` sub-subcommand > +also provides 'gnuplot' based heatmap image creation. For this, you can use > +``--heatmap`` option. Also, note that because it uses 'gnuplot' internally, it > +will fail if 'gnuplot' is not installed on your system. For example:: > + > + $ ./damo report heats --heatmap heatmap.png > + > +Creates ``heatmap.png`` file containing the heatmap image. It supports > +``pdf``, ``png``, ``jpeg``, and ``svg``. > + > +For proper zoom in / zoom out, you need to see the layout of the record. For > +that, use '--guide' option. If the option is given, it will provide useful > +information about the records in the record file. For example:: > + > + $ ./damo report heats --guide > + pid:1348 > + time: 193485829398-198337863555 (4852034157) > + region 0: 00000094564599762944-00000094564622589952 (22827008) > + region 1: 00000140454009610240-00000140454016012288 (6402048) > + region 2: 00000140731597193216-00000140731597443072 (249856) > + > +The output shows monitored regions (start and end addresses in byte) and > +monitored time duration (start and end time in nanosecond) of each target task. > +Therefore, it would be wise to plot only each region rather than plotting > +entire address space in one heatmap because the gaps between the regions are so > +huge in this case. > + > + > +wss > +~~~ > + > +The ``wss`` type shows the distribution or time-varying working set sizes of > +the recorded workload using the records. For example:: > + > + $ ./damo report wss > + # <percentile> <wss> > + # pid 1348 > + # avr: 66228 > + 0 0 > + 25 0 > + 50 0 > + 75 0 > + 100 1920615 > + > +Without any option, it shows the distribution of the working set sizes as > +above. Basically it shows 0th, 25th, 50th, 75th, and 100th percentile and > +average of the measured working set sizes in the access pattern records. In > +this case, the working set size was zero for 75th percentile but 1,920,615 > +bytes in max and 66,228 in average. > + > +By setting the sort key of the percentile using '--sortby', you can also see > +how the working set size is chronologically changed. For example:: > + > + $ ./damo report wss --sortby time > + # <percentile> <wss> > + # pid 1348 > + # avr: 66228 > + 0 0 > + 25 0 > + 50 0 > + 75 0 > + 100 0 > + > +The average is still 66,228. And, because we sorted the working set using > +recorded time and the access is very short, we cannot show when the access > +made. > + > +Users can specify the resolution of the distribution (``--range``). It also > +supports 'gnuplot' based simple visualization (``--plot``) of the distribution. > diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst > index 11db46448354..d3d0ba373eb6 100644 > --- a/Documentation/admin-guide/mm/index.rst > +++ b/Documentation/admin-guide/mm/index.rst > @@ -27,6 +27,7 @@ the Linux memory management. > > concepts > cma_debugfs > + data_access_monitor > hugetlbpage > idle_page_tracking > ksm