Adding more recipients for comments. The original RFC mail is available at: https://lore.kernel.org/linux-mm/20200110131522.29964-1-sjpark@xxxxxxxxxx/ Thanks, SeongJae Park On Fri, 10 Jan 2020 14:15:17 +0100 SeongJae Park <sjpark@xxxxxxxxxx> wrote: > From: SeongJae Park <sjpark@xxxxxxxxx> > > This RFC patchset introduces a new kernel module for practical monitoring of > data accesses, namely DAMON. > > The patches are organized in the following sequence. The first and second > patch introduces the core logic and the raw level user interface of DAMON, > respectively. To provide a minimal reference to the raw level interfaces and > for more convenient test of the DAMON itself, the third patch implements an > user space wrapper tools for the DAMON. The fourth patch adds a document for > the DAMON, and finally the fifth patch provides DAMON's unit tests, which is > using the kunit framework. > > The patches are based on the v5.4 plus the back-ported kunit, which retrieved > from v5.5-rc1. You can also clone the complete git tree by: > > $ git clone git://github.com/sjp38/linux -b damon/rfc/v1 > > The web is also available: > https://github.com/sjp38/linux/releases/tag/damon/rfc/v1 > > ---- > > DAMON is a kernel module that allows users to monitor the actual memory access > pattern of specific user-space processes. It aims to be 1) accurate enough to > be useful for performance-centric domains, and 2) sufficiently light-weight so > that it can be applied online. > > For the goals, DAMON utilizes its two core mechanisms, called region-based > sampling and adaptive regions adjustment. The region-based sampling allows > users to make their own trade-off between the quality and the overhead of the > monitoring and set the upperbound of the monitoring overhead. Further, the > adaptive regions adjustment mechanism makes DAMON to maximize the quality and > minimize the overhead with its best efforts while preserving the users > configured trade-off. > > > Background > ========== > > For performance-centric analysis and optimizations of memory management schemes > (either that of kernel space or user space), the actual data access pattern of > the workloads is highly useful. The information need to be only reasonable > rather than strictly correct, because some level of incorrectness can be > handled in many performance-centric domains. It also need to be taken within > reasonably short time with only light-weight overhead. > > Manually extracting such data is not easy and time consuming if the target > workload is huge and complex, even for the developers of the programs. There > are a range of tools and techniques developed for general memory access > investigations, and some of those could be partially used for this purpose. > However, most of those are not practical or unscalable, mainly because those > are designed with no consideration about the trade-off between the accuracy of > the output and the overhead. > > The memory access instrumentation techniques which is applied to many tools > such as Intel PIN is essential for correctness required cases such as invalid > memory access bug detections. However, those usually incur high overhead which > is unacceptable for many of the performance-centric domains. Periodic access > checks based on H/W or S/W access counting features (e.g., the Accessed bits of > PTEs or the PG_Idle flags of pages) can dramatically decrease the overhead by > forgiving some of the quality, compared to the instrumentation based > techniques. The reduced quality is still reasonable for many of the domains, > but the overhead can arbitrarily increase as the size of the target workload > grows. Miniature-like static region based sampling can set the upperbound of > the overhead, but it will now decrease the quality of the output as the size of > the workload grows. > > > Related Works > ============= > > There are a number of researches[1,2,3,4,5,6] optimizing memory management > mechanisms based on the actual memory access patterns that shows impressive > results. However, most of those has no deep consideration about the monitoring > of the accesses itself. Some of those focused on the overhead of the > monitoring, but does not consider the accuracy scalability[6] or has additional > dependencies[7]. Indeed, one recent research[5] about the proactive > reclamation has also proposed[8] to the kernel community but the monitoring > overhead was considered a main problem. > > [1] Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, > Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2016. > Data tiering in heterogeneous memory systems. In Proceedings of the 11th > European Conference on Computer Systems (EuroSys). ACM, 15. > [2] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J Rossbach, and Emmett > Witchel. 2016. Coordinated and efficient huge page management with ingens. > In 12th USENIX Symposium on Operating Systems Design and Implementation > (OSDI). 705–721. > [3] Harald Servat, Antonio J Peña, Germán Llort, Estanislao Mercadal, > HansChristian Hoppe, and Jesús Labarta. 2017. Automating the application > data placement in hybrid memory systems. In 2017 IEEE International > Conference on Cluster Computing (CLUSTER). IEEE, 126–136. > [4] Vlad Nitu, Boris Teabe, Alain Tchana, Canturk Isci, and Daniel Hagimont. > 2018. Welcome to zombieland: practical and energy-efficient memory > disaggregation in a datacenter. In Proceedings of the 13th European > Conference on Computer Systems (EuroSys). ACM, 16. > [5] Andres Lagar-Cavilla, Junwhan Ahn, Suleiman Souhlal, Neha Agarwal, Radoslaw > Burny, Shakeel Butt, Jichuan Chang, Ashwin Chaugule, Nan Deng, Junaid > Shahid, Greg Thelen, Kamil Adam Yurtsever, Yu Zhao, and Parthasarathy > Ranganathan. 2019. Software-Defined Far Memory in Warehouse-Scale > Computers. In Proceedings of the 24th International Conference on > Architectural Support for Programming Languages and Operating Systems > (ASPLOS). ACM, New York, NY, USA, 317–330. > DOI:https://doi.org/10.1145/3297858.3304053 > [6] Carl Waldspurger, Trausti Saemundsson, Irfan Ahmad, and Nohhyun Park. > 2017. Cache Modeling and Optimization using Miniature Simulations. In 2017 > USENIX Annual Technical Conference (ATC). USENIX Association, Santa > Clara, CA, 487–498. > https://www.usenix.org/conference/atc17/technical-sessions/ > [7] Haojie Wang, Jidong Zhai, Xiongchao Tang, Bowen Yu, Xiaosong Ma, and > Wenguang Chen. 2018. Spindle: Informed Memory Access Monitoring. In 2018 > USENIX Annual Technical Conference (ATC). USENIX Association, Boston, MA, > 561–574. https://www.usenix.org/conference/atc18/presentation/wang-haojie > [8] Jonathan Corbet. 2019. Proactively reclaiming idle memory. (2019). > https://lwn.net/Articles/787611/. > > > Expected Use-cases > ================== > > A straightforward usecase of DAMON would be the program behavior analysis. > With the DAMON output, users can confirm whether the program is running as > intended or not. This will be useful for debuggings and tests of design > points. > > The monitored results can also be useful for counting the dynamic working set > size of workloads. For the administration of memory overcommitted systems or > selection of the environments (e.g., containers providing different amount of > memory) for your workloads, this will be useful. > > If you are a programmer, you can optimize your program by managing the memory > based on the actual data access pattern. For example, you can identify the > dynamic hotness of your data using DAMON and call ``mlock()`` to keep your hot > data in DRAM, or call ``madvise()`` with ``MADV_PAGEOUT`` to proactively > reclaim cold data. Even though your program is guaranteed to not encounter > memory pressure, you can still improve the performance by applying the DAMON > outputs for call of ``MADV_HUGEPAGE`` and ``MADV_NOHUGEPAGE``. More creative > optimizations would be possible. Our evaluations of DAMON includes a > straightforward optimization using the ``mlock()``. Please refer to the below > Evaluation section for more detail. > > As DAMON incurs very low overhead, such optimizations can be applied not only > offline, but also online. Also, there is no reason to limit such optimizations > to the user space. Several parts of the kernel's memory management mechanisms > could be also optimized using DAMON. The reclamation, the THP (de)promotion > decisions, and the compaction would be such a candidates. Nevertheless, > current version of DAMON is not highly optimized for the online/in-kernel uses. > > > Mechanisms of DAMON > =================== > > > Basic Access Check > ------------------ > > DAMON basically reports what pages are how frequently accessed. The report is > passed to users in binary format via a ``result file`` which users can set it's > path. Note that the frequency is not an absolute number of accesses, but a > relative frequency among the pages of the target workloads. > > Users can also control the resolution of the reports by setting two time > intervals, ``sampling interval`` and ``aggregation interval``. In detail, > DAMON checks access to each page per ``sampling interval``, aggregates the > results (counts the number of the accesses to each page), and reports the > aggregated results per ``aggregation interval``. For the access check of each > page, DAMON uses the Accessed bits of PTEs. > > This is thus similar to the previously mentioned periodic access checks based > mechanisms, which overhead is increasing as the size of the target process > grows. > > > Region Based Sampling > --------------------- > > To avoid the unbounded increase of the overhead, DAMON groups a number of > adjacent pages that assumed to have same access frequencies into a region. As > long as the assumption (pages in a region have same access frequencies) is > kept, only one page in the region is required to be checked. Thus, for each > ``sampling interval``, DAMON randomly picks one page in each region and clears > its Accessed bit. After one more ``sampling interval``, DAMON reads the > Accessed bit of the page and increases the access frequency of the region if > the bit has set meanwhile. Therefore, the monitoring overhead is controllable > by setting the number of regions. DAMON allows users to set the minimal and > maximum number of regions for the trade-off. > > Except the assumption, this is almost same with the above-mentioned > miniature-like static region based sampling. In other words, this scheme > cannot preserve the quality of the output if the assumption is not guaranteed. > > > Adaptive Regions Adjustment > --------------------------- > > At the beginning of the monitoring, DAMON constructs the initial regions by > evenly splitting the memory mapped address space of the process into the > user-specified minimal number of regions. In this initial state, the > assumption is normally not kept and thus the quality could be low. To keep the > assumption as much as possible, DAMON adaptively merges and splits each region. > For each ``aggregation interval``, it compares the access frequencies of > adjacent regions and merges those if the frequency difference is small. Then, > after it reports and clears the aggregated access frequency of each region, it > splits each region into two regions if the total number of regions is smaller > than the half of the user-specified maximum number of regions. > > In this way, DAMON provides its best-effort quality and minimal overhead while > keeping the bounds users set for their trade-off. > > > Applying Dynamic Memory Mappings > -------------------------------- > > Only a number of small parts in the super-huge virtual address space of the > processes is mapped to physical memory and accessed. Thus, tracking the > unmapped address regions is just wasteful. However, tracking every memory > mapping change might incur an overhead. For the reason, DAMON applies the > dynamic memory mapping changes to the tracking regions only for each of an > user-specified time interval (``regions update interval``). > > > Evaluations > =========== > > A prototype of DAMON has evaluated on an Intel Xeon E7-8837 machine using 20 > benchmarks that picked from SPEC CPU 2006, NAS, Tensorflow Benchmark, > SPLASH-2X, and PARSEC 3 benchmark suite. Nonethless, this section provides > only summary of the results. For more detail, please refer to the slides used > for the introduction of DAMON at the Linux Plumbers Conference 2019[1] or the > MIDDLEWARE'19 industrial track paper[2]. > > > Quality > ------- > > We first traced and visualized the data access pattern of each workload. We > were able to confirm that the visualized results are reasonably accurate by > manually comparing those with the source code of the workloads. > > To see the usefulness of the monitoring, we optimized 9 memory intensive > workloads among them for memory pressure situations using the DAMON outputs. > In detail, we identified frequently accessed memory regions in each workload > based on the DAMON results and protected them with ``mlock()`` system calls. > The optimized versions consistently show speedup (2.55x in best case, 1.65x in > average) under memory pressure situation. > > > Overhead > -------- > > We also measured the overhead of DAMON. It was not only under the upperbound > we set, but was much lower (0.6 percent of the bound in best case, 13.288 > percent of the bound in average). This reduction of the overhead is mainly > resulted from the adaptive regions adjustment. We also compared the overhead > with that of the straightforward periodic Accessed bit check-based monitoring, > which checks the access of every page frame. DAMON's overhead was much smaller > than the straightforward mechanism by 94,242.42x in best case, 3,159.61x in > average. > > > References > ========== > > Prototypes of DAMON have introduced by an LPC kernel summit track talk[1] and > two academic papers[2,3]. Please refer to those for more detailed information, > especially the evaluations. > > [1] SeongJae Park, Tracing Data Access Pattern with Bounded Overhead and > Best-effort Accuracy. In The Linux Kernel Summit, September 2019. > https://linuxplumbersconf.org/event/4/contributions/548/ > [2] SeongJae Park, Yunjae Lee, Heon Y. Yeom, Profiling Dynamic Data Access > Patterns with Controlled Overhead and Quality. In 20th ACM/IFIP > International Middleware Conference Industry, December 2019. > https://dl.acm.org/doi/10.1145/3366626.3368125 > [3] SeongJae Park, Yunjae Lee, Yunhee Kim, Heon Y. Yeom, Profiling Dynamic Data > Access Patterns with Bounded Overhead and Accuracy. In IEEE International > Workshop on Foundations and Applications of Self- Systems (FAS 2019), June > 2019. > > > SeongJae Park (5): > mm: Introduce Data Access MONitor (DAMON) > mm/damon: Add debugfs interface > mm/damon: Add minimal user-space tools > Documentation/admin-guide/mm: Add a document for DAMON > mm/damon: Add kunit tests > > .../admin-guide/mm/data_access_monitor.rst | 235 +++ > Documentation/admin-guide/mm/index.rst | 1 + > mm/Kconfig | 23 + > mm/Makefile | 1 + > mm/damon-test.h | 571 ++++++++ > mm/damon.c | 1266 +++++++++++++++++ > tools/damon/bin2txt.py | 64 + > tools/damon/damn | 36 + > tools/damon/heats.py | 358 +++++ > tools/damon/nr_regions.py | 116 ++ > tools/damon/record.py | 182 +++ > tools/damon/report.py | 45 + > tools/damon/wss.py | 121 ++ > 13 files changed, 3019 insertions(+) > create mode 100644 Documentation/admin-guide/mm/data_access_monitor.rst > create mode 100644 mm/damon-test.h > create mode 100644 mm/damon.c > create mode 100644 tools/damon/bin2txt.py > create mode 100644 tools/damon/damn > create mode 100644 tools/damon/heats.py > create mode 100644 tools/damon/nr_regions.py > create mode 100644 tools/damon/record.py > create mode 100644 tools/damon/report.py > create mode 100644 tools/damon/wss.py > > -- > 2.17.1 >