From: SeongJae Park <sjpark@xxxxxxxxx> Introduction ============ As page-granularity access checking overhead could be too significant on huge systems, LRU lists are normally sorted reactively and partially for special events including explicit system calls and memory pressure. As a result, LRU lists could be not well sorted to be used for finding good reclamation target pages, especially when memory pressure is first happened after a while. Proactive reclamation is well known to be helpful for minimizing the memory pressure performance drops. However, proactive reclamation could incur additional I/O, so not a best option for some cases. For an example, cloud block storages would charge each I/O. Using DAMON for Proactive LRU-lists Sorting (PLRUS) could be helpful for this situation, as DAMON can identify access patterns while inducing only controlled overhead. The idea is simple. Find hot pages and cold pages using DAMON, and do 'mark_page_accessed()' for the hot pages while doing 'deactivate_page()' for the cold pages. This patchset extends DAMON to support PLRUS by introducing a new DAMOS action for doing the 'mark_page_accessed()' to memory regions of a specific access pattern, and supporting 'cold' DAMOS action from the physical address space monitoring operations set. In terms of making reclamation less harmful, PLRUS will work similar to the proactive reclamation, but avoids the additional I/Os. Of course, PLRUS will not reduce memory utilization on its own, unlike proactive reclamation. If that's a problem, doing DAMON-based proactive reclamation (DAMON_RECLAIM) simultaneously for only super cold pages, or for severe memory pressure could work. One additional advantage of PLRUS is that it makes LRU lists a more trustworthy source of access patterns. Example DAMON-based Operation Schemes for PLRUS =============================================== So, users will be able to do PLRUS via DAMON-based Operation Schemes (DAMOS) after applying this patchset. An example of such DAMOS config for PLRUS would be something like below. Sorry for the crippy format. Please refer to the parser script[1] for detail of the format. In short, this config asks DAMON to 1. find any memory regions of >=4K size having shown at least some access (approximately 20 accesses per 100 sampling) and apply 'mark_accessed()' to those using up to 2% CPU time. Under the CPU time limit, apply the function to regions having higher access frequency and kept the access frequency longer first. 2. find any memory regions of >=4K size having shown no access for 200ms or more and 'deactivate()' those using up to 2% CPU time. Under the CPU time limit, apply the function to regions kept the no access longer first. # format is: # <min/max size> <min/max frequency (0-100)> <min/max age> <action> \ # <quota> <weights> <watermarks> # LRU-activate hot pages (more hot ones first) under 2% CPU usage limit 4K max 20 max min max hot \ 20ms 0B 1s 0 7 3 free_mem_rate 5s 1000 999 0 # LRU-deactivate cold pages (colder ones first) under 2% CPU usage limit 4K max min min 20ms max cold \ 20ms 0B 1s 0 3 7 free_mem_rate 5s 1000 999 0 [1] https://github.com/awslabs/damo/blob/next/_convert_damos.py Evaluation ========== To show the effect of PLRUS, I ran PARSEC3 and SPLASH-2X benchmarks under below variant kernels and measured the runtime of each workload. - orig: Latest mm-unstable kernel + this patchset, but no DAMON scheme applied. - mprs: Same to orig but have artificial memory pressure. - plrus: Same to mprs but above example PLRUS scheme is applied to the physical address space of the system. For the artificial memory pressure, I set memory.limit_in_bytes to 75% of the running workload's peak RSS, wait 3 seconds, remove the pressure by setting it to 200% of the running workload's peak RSS, wait 30 seconds, and repeat the procedure until the workload finishes[1]. I use zram based swap device. The tests are automated[2]. I repeat the tests five times and calculate average runtime of the five measurements. The results are as below: runtime_secs orig mprs plrus plrus/mprs parsec3/blackscholes 139.35 139.68 140.37 1.00 parsec3/bodytrack 124.67 127.26 128.31 1.01 parsec3/canneal 207.61 400.95 355.23 0.89 parsec3/dedup 18.30 18.84 19.30 1.02 parsec3/facesim 350.42 353.69 349.14 0.99 parsec3/fluidanimate 338.57 337.16 342.18 1.01 parsec3/freqmine 434.39 435.67 436.49 1.00 parsec3/raytrace 182.24 186.18 189.08 1.02 parsec3/streamcluster 634.49 2993.27 2576.04 0.86 parsec3/swaptions 221.68 221.84 221.97 1.00 parsec3/vips 87.82 103.01 103.18 1.00 parsec3/x264 108.92 132.82 128.22 0.97 splash2x/barnes 130.30 135.87 138.52 1.02 splash2x/fft 62.09 98.33 99.85 1.02 splash2x/lu_cb 132.15 135.49 135.22 1.00 splash2x/lu_ncb 149.89 154.92 155.26 1.00 splash2x/ocean_cp 80.04 108.20 113.85 1.05 splash2x/ocean_ncp 163.70 217.40 231.09 1.06 splash2x/radiosity 142.32 143.13 144.50 1.01 splash2x/radix 50.28 78.21 85.96 1.10 splash2x/raytrace 133.75 134.21 136.21 1.01 splash2x/volrend 120.39 121.72 120.87 0.99 splash2x/water_nsquared 373.37 388.31 398.72 1.03 splash2x/water_spatial 133.81 143.73 144.00 1.00 total 4520.54 7309.87 6893.55 0.94 average 188.36 304.58 287.23 0.94 The second, third, and fourth cells shows the runtime of each workload under the configs in seconds, and the fifth cell shows the plrus runtime divided by mprs runtime. On average, 'plrus' achieves about 6% speedup under memory pressure. For the two best cases (parsec3/canneal and parsec3/streamcluster), 'plrus' achieves about 11% and 14% speedup under memory pressure. Please note that the scheme is not tuned for each workload, applied to entire system memory, and uses only up to 4% single CPU time. [1] https://github.com/awslabs/damon-tests/blob/next/perf/runners/back/0009_memcg_pressure.sh [2] https://github.com/awslabs/damon-tests/tree/next/perf Sequence of Patches =================== The first patch cleans up DAMOS_PAGEOUT handling code of physical address space monitoring operations implementation for easier extension of the code. The second patch implements a new DAMOS action called 'hot', which applies 'mark_page_accessed()' to the pages under the memory regions having the target access pattern. Finally, the third patch makes the physical address space monitoring operations implementation supports the 'cold' action, which applies 'deactivate_page()' to the pages under the memory regions having the target access pattern. SeongJae Park (3): mm/damon/paddr: move DAMOS_PAGEOUT handling to a separate function mm/damon/schemes: Support 'hot' action mm/damon/paddr: Support DAMOS_COLD include/linux/damon.h | 2 ++ mm/damon/ops-common.c | 42 ++++++++++++++++++++++++++++++ mm/damon/ops-common.h | 2 ++ mm/damon/paddr.c | 60 ++++++++++++++++++++++++++++++++++++++----- mm/damon/sysfs.c | 1 + 5 files changed, 101 insertions(+), 6 deletions(-) -- 2.17.1