In addition to the ping, I'd like to share our recent real-world usecase of DAMON. I just hope this real user story makes more comments than previously shared benchmark results. DAMON as a profiler ------------------- Recently, we analyzed characteristics of a large scale production systems utilizing 70GB DRAM and 36 CPUs using DAMON. From this, we were able to find interesting things including below. There were obviously different access pattern under idle workload and active workload. Under the idle workload, it accessed larger memory regions with low frequency (resembles that of scanning workload), while the active workload accessing smaller memory regions with high freuqnecy. DAMON found a 7GB memory region that showing obviously high access frequency under the active workload. We believe this is the performance-effective working set and need to be protected. There was a 4KB memory region that showing highest access frequency under not only active but also idle workloads. I think this must be a code section like thing. For this analysis, DAMON used only 0.3-1% of single CPU time. Because we used recording-based analysis, it consumed about 3-12 MB of disk space per 20 minutes. This is only small amount of disk space, but we can further reduce the disk usage by using non-recording-based DAMON features. I'd like to argue that only DAMON can do such detailed analysis (finding 4KB highest region in 70GB memory) with the light overhead. DAMON as a system optimization tool ----------------------------------- We also found below potential performance problems on the systems and made DAMON-based solutions. The system doesn't want to make the workload suffer from the page reclamation and thus it utilizes enough DRAM but no swap device. However, we found the system is actively reclaiming file-backed pages, because the system has intensive file IO. The file IO turned out to be not performance critical for the workload, but we want to ensure performance critical file-backed pages like code section to not mistakenly be evicted. Straightforward solution should be using direct IO, but modifying the workload is not so easy. We also considered `mlockall()`, but we couldn't use that because VSZ of the workload is much larger than the physical DRAM. Finding the region and calling `mlock()` might be a right solution, but modifying the system is still not easy. We found the fact that the DAMON-based operation scheme[1] could be used. By using it, we can ask DAMON to track access frequency of each region and make 'process_madvise(MADV_WILLNEED)[2]' call for regions having specific size and access frequency for a time interval. We also found the system is having high number of TLB misses. We tried 'always' THP enabled policy and it greatly reduced TLB misses, but the page reclamation also been more frequent due to the THP internal fragmentation caused memory bloat. We will try another DAMON-based operation scheme for applying 'MADV_HUGEPAGE' to memory regions having >=2MB size and high access frequency, while applying 'MADV_NOHUGEPAGE' to regions having <2MB size and low access frequency. We do not own the systems so we only reported the analysis results and possible optimization solutions to the owners. The owners satisfied about the analysis results and promised to try the optimization guides. [1] https://lore.kernel.org/linux-mm/20201006123931.5847-1-sjpark@xxxxxxxxxx/ [2] https://lore.kernel.org/linux-api/20200622192900.22757-4-minchan@xxxxxxxxxx/ In summary, DAMON has used on production systems and proved its usefulness. Thanks, SeongJae Park On Tue, 17 Nov 2020 09:05:39 +0100 SeongJae Park <sjpark@xxxxxxxxxx> wrote: > Another week, another ping. I'm waiting for _any_ comments. > > > Thanks, > SeongJae Park > > On Wed, 11 Nov 2020 17:41:13 +0100 SeongJae Park <sjpark@xxxxxxxxxx> wrote: > > > Hi, I'd like to remind you that I'm still waiting more reviews. Any comments > > are welcome. > > > > > > Thanks, > > SeongJae Park > > > > On Tue, 20 Oct 2020 10:59:22 +0200 SeongJae Park <sjpark@xxxxxxxxxx> wrote: > > > > > From: SeongJae Park <sjpark@xxxxxxxxx> > > > > > > Changes from Previous Version (v21) > > > =================================== > > > > > > This version contains below minor changes. > > > > > > - Fix build warnings and errors (kernel test robot) > > > - Fix a memory leak (kmemleak) > > > - Respect KUNIT_ALL_TESTS > > > - Rebase on v5.9 > > > - Update the evaluation results > > > > > > Introduction > > > ============ > > > > > > DAMON is a data access monitoring framework for the Linux kernel. The core > > > mechanisms of DAMON called 'region based sampling' and 'adaptive regions > > > adjustment' (refer to 'mechanisms.rst' in the 11th patch of this patchset for > > > the detail) make it > > > > > > - accurate (The monitored information is useful for DRAM level memory > > > management. It might not appropriate for Cache-level accuracy, though.), > > > - light-weight (The monitoring overhead is low enough to be applied online > > > while making no impact on the performance of the target workloads.), and > > > - scalable (the upper-bound of the instrumentation overhead is controllable > > > regardless of the size of target workloads.). > > > > > > Using this framework, therefore, several memory management mechanisms such as > > > reclamation and THP can be optimized to aware real data access patterns. > > > Experimental access pattern aware memory management optimization works that > > > incurring high instrumentation overhead will be able to have another try. > > > > > > Though DAMON is for kernel subsystems, it can be easily exposed to the user > > > space by writing a DAMON-wrapper kernel subsystem. Then, user space users who > > > have some special workloads will be able to write personalized tools or > > > applications for deeper understanding and specialized optimizations of their > > > systems. > > > > > > Evaluations > > > =========== > > > > > > We evaluated DAMON's overhead, monitoring quality and usefulness using 24 > > > realistic workloads on my QEMU/KVM based virtual machine running a kernel that > > > v22 DAMON patchset is applied. > > > > > > DAMON is lightweight. It increases system memory usage by 0.25% and slows > > > target workloads down by 0.89%. > > > > > > DAMON is accurate and useful for memory management optimizations. An > > > experimental DAMON-based operation scheme for THP, 'ethp', removes 81.73% of > > > THP memory overheads while preserving 95.29% of THP speedup. Another > > > experimental DAMON-based 'proactive reclamation' implementation, 'prcl', > > > reduces 91.30% of residential sets and 23.45% of system memory footprint while > > > incurring only 2.08% runtime overhead in the best case (parsec3/freqmine). > > > > > > NOTE that the experimentail THP optimization and proactive reclamation are not > > > for production but only for proof of concepts. > > > > > > Please refer to the official document[1] or "Documentation/admin-guide/mm: Add > > > a document for DAMON" patch in this patchset for detailed evaluation setup and > > > results. > > > > > > [1] https://damonitor.github.io/doc/html/latest-damon/admin-guide/mm/damon/eval.html > > >