Hi SeongJae, On Tue, 27 Feb 2024 15:51:20 -0800 SeongJae Park <sj@xxxxxxxxxx> wrote: > On Mon, 26 Feb 2024 23:05:46 +0900 Honggyu Kim <honggyu.kim@xxxxxx> wrote: > > > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > > posted at [1]. > > > > It says there is no implementation of the demote/promote DAMOS action > > are made. This RFC is about its implementation for physical address > > space. > > > > > > Introduction > > ============ > > > > With the advent of CXL/PCIe attached DRAM, which will be called simply > > as CXL memory in this cover letter, some systems are becoming more > > heterogeneous having memory systems with different latency and bandwidth > > characteristics. They are usually handled as different NUMA nodes in > > separate memory tiers and CXL memory is used as slow tiers because of > > its protocol overhead compared to local DRAM. > > > > In this kind of systems, we need to be careful placing memory pages on > > proper NUMA nodes based on the memory access frequency. Otherwise, some > > frequently accessed pages might reside on slow tiers and it makes > > performance degradation unexpectedly. Moreover, the memory access > > patterns can be changed at runtime. > > > > To handle this problem, we need a way to monitor the memory access > > patterns and migrate pages based on their access temperature. The > > DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation > > Schemes) can be useful features for monitoring and migrating pages. > > DAMOS provides multiple actions based on DAMON monitoring results and it > > can be used for proactive reclaim, which means swapping cold pages out > > with DAMOS_PAGEOUT action, but it doesn't support migration actions such > > as demotion and promotion between tiered memory nodes. > > > > This series supports two new DAMOS actions; DAMOS_DEMOTE for demotion > > from fast tiers and DAMOS_PROMOTE for promotion from slow tiers. This > > prevents hot pages from being stuck on slow tiers, which makes > > performance degradation and cold pages can be proactively demoted to > > slow tiers so that the system can increase the chance to allocate more > > hot pages to fast tiers. > > > > The DAMON provides various tuning knobs but we found that the proactive > > demotion for cold pages is especially useful when the system is running > > out of memory on its fast tier nodes. > > > > Our evaluation result shows that it reduces the performance slowdown > > compared to the default memory policy from 15~17% to 4~5% when the > > system runs under high memory pressure on its fast tier DRAM nodes. > > > > > > DAMON configuration > > =================== > > > > The specific DAMON configuration doesn't have to be in the scope of this > > patch series, but some rough idea is better to be shared to explain the > > evaluation result. > > > > The DAMON provides many knobs for fine tuning but its configuration file > > is generated by HMSDK[2]. It includes gen_config.py script that > > generates a json file with the full config of DAMON knobs and it creates > > multiple kdamonds for each NUMA node when the DAMON is enabled so that > > it can run hot/cold based migration for tiered memory. > > I was feeling a bit confused from here since DAMON doesn't receive parameters > via a file. To my understanding, the 'configuration file' means the input file > for DAMON user-space tool, damo, not DAMON. Just a trivial thing, but making > it clear if possible could help readers in my opinion. > > > > > > > Evaluation Workload > > =================== > > > > The performance evaluation is done with redis[3], which is a widely used > > in-memory database and the memory access patterns are generated via > > YCSB[4]. We have measured two different workloads with zipfian and > > latest distributions but their configs are slightly modified to make > > memory usage higher and execution time longer for better evaluation. > > > > The idea of evaluation using these demote and promote actions covers > > system-wide memory management rather than partitioning hot/cold pages of > > a single workload. The default memory allocation policy creates pages > > to the fast tier DRAM node first, then allocates newly created pages to > > the slow tier CXL node when the DRAM node has insufficient free space. > > Once the page allocation is done then those pages never move between > > NUMA nodes. It's not true when using numa balancing, but it is not the > > scope of this DAMON based 2-tier memory management support. > > > > If the working set of redis can be fit fully into the DRAM node, then > > the redis will access the fast DRAM only. Since the performance of DRAM > > only is faster than partially accessing CXL memory in slow tiers, this > > environment is not useful to evaluate this patch series. > > > > To make pages of redis be distributed across fast DRAM node and slow > > CXL node to evaluate our demote and promote actions, we pre-allocate > > some cold memory externally using mmap and memset before launching > > redis-server. We assumed that there are enough amount of cold memory in > > datacenters as TMO[5] and TPP[6] papers mentioned. > > > > The evaluation sequence is as follows. > > > > 1. Turn on DAMON with DAMOS_DEMOTE action for DRAM node and > > DAMOS_PROMOTE action for CXL node. It demotes cold pages on DRAM > > node and promotes hot pages on CXL node in a regular interval. > > 2. Allocate a huge block of cold memory by calling mmap and memset at > > the fast tier DRAM node, then make the process sleep to make the fast > > tier has insufficient memory for redis-server. > > 3. Launch redis-server and load prebaked snapshot image, dump.rdb. The > > redis-server consumes 52GB of anon pages and 33GB of file pages, but > > due to the cold memory allocated at 2, it fails allocating the entire > > memory of redis-server on the fast tier DRAM node so it partially > > allocates the remaining on the slow tier CXL node. The ratio of > > DRAM:CXL depends on the size of the pre-allocated cold memory. > > 4. Run YCSB to make zipfian or latest distribution of memory accesses to > > redis-server, then measure its execution time when it's completed. > > 5. Repeat 4 over 50 times to measure the average execution time for each > > run. > > 6. Increase the cold memory size then repeat goes to 2. > > > > For each test at 4 took about a minute so repeating it 50 times almost > > took about 1 hour for each test with a specific cold memory from 440GB > > to 500GB in 10GB increments for each evaluation. So it took about more > > than 10 hours for both zipfian and latest workloads to get the entire > > evaluation results. Repeating the same test set multiple times doesn't > > show much difference so I think it might be enough to make the result > > reliable. > > > > > > Evaluation Results > > ================== > > > > All the result values are normalized to DRAM-only execution time because > > the workload cannot be faster than DRAM-only unless the workload hits > > the bandwidth peak but our redis test doesn't go beyond the bandwidth > > limit. > > > > So the DRAM-only execution time is the ideal result without affected by > > the gap between DRAM and CXL performance difference. The NUMA node > > environment is as follows. > > > > node0 - local DRAM, 512GB with a CPU socket (fast tier) > > node1 - disabled > > node2 - CXL DRAM, 96GB, no CPU attached (slow tier) > > > > The following is the result of generating zipfian distribution to > > redis-server and the numbers are averaged by 50 times of execution. > > > > 1. YCSB zipfian distribution read only workload > > memory pressure with cold memory on node0 with 512GB of local DRAM. > > =============+================================================+========= > > | cold memory occupied by mmap and memset | > > | 0G 440G 450G 460G 470G 480G 490G 500G | > > =============+================================================+========= > > Execution time normalized to DRAM-only values | GEOMEAN > > -------------+------------------------------------------------+--------- > > DRAM-only | 1.00 - - - - - - - | 1.00 > > CXL-only | 1.21 - - - - - - - | 1.21 > > default | - 1.09 1.10 1.13 1.15 1.18 1.21 1.21 | 1.15 > > DAMON 2-tier | - 1.02 1.04 1.05 1.04 1.05 1.05 1.06 | 1.04 > > =============+================================================+========= > > CXL usage of redis-server in GB | AVERAGE > > -------------+------------------------------------------------+--------- > > DRAM-only | 0.0 - - - - - - - | 0.0 > > CXL-only | 52.6 - - - - - - - | 52.6 > > default | - 19.4 26.1 32.3 38.5 44.7 50.5 50.3 | 37.4 > > DAMON 2-tier | - 0.1 1.6 5.2 8.0 9.1 11.8 13.6 | 7.1 > > =============+================================================+========= > > > > Each test result is based on the exeuction environment as follows. > > > > DRAM-only : redis-server uses only local DRAM memory. > > CXL-only : redis-server uses only CXL memory. > > default : default memory policy(MPOL_DEFAULT). > > numa balancing disabled. > > DAMON 2-tier: DAMON enabled with DAMOS_DEMOTE for DRAM nodes and > > DAMOS_PROMOTE for CXL nodes. > > > > The above result shows the "default" execution time goes up as the size > > of cold memory is increased from 440G to 500G because the more cold > > memory used, the more CXL memory is used for the target redis workload > > and this makes the execution time increase. > > > > However, "DAMON 2-tier" result shows less slowdown because the > > DAMOS_DEMOTE action at DRAM node proactively demotes pre-allocated cold > > memory to CXL node and this free space at DRAM increases more chance to > > allocate hot or warm pages of redis-server to fast DRAM node. Moreover, > > DEMOS_PROMOTE action at CXL node also promotes hot pages of redis-server > > to DRAM node actively. > > > > As a result, it makes more memory of redis-server stay in DRAM node > > compared to "default" memory policy and this makes the performance > > improvement. > > > > The following result of latest distribution workload shows similar data. > > > > 2. YCSB latest distribution read only workload > > memory pressure with cold memory on node0 with 512GB of local DRAM. > > =============+================================================+========= > > | cold memory occupied by mmap and memset | > > | 0G 440G 450G 460G 470G 480G 490G 500G | > > =============+================================================+========= > > Execution time normalized to DRAM-only values | GEOMEAN > > -------------+------------------------------------------------+--------- > > DRAM-only | 1.00 - - - - - - - | 1.00 > > CXL-only | 1.18 - - - - - - - | 1.18 > > default | - 1.16 1.15 1.17 1.18 1.16 1.18 1.15 | 1.17 > > DAMON 2-tier | - 1.04 1.04 1.05 1.05 1.06 1.05 1.06 | 1.05 > > =============+================================================+========= > > CXL usage of redis-server in GB | AVERAGE > > -------------+------------------------------------------------+--------- > > DRAM-only | 0.0 - - - - - - - | 0.0 > > CXL-only | 52.6 - - - - - - - | 52.6 > > default | - 19.3 26.1 32.2 38.5 44.6 50.5 50.6 | 37.4 > > DAMON 2-tier | - 1.3 3.8 7.0 4.1 9.4 12.5 16.7 | 7.8 > > =============+================================================+========= > > > > In summary of both results, our evaluation shows that "DAMON 2-tier" > > memory management reduces the performance slowdown compared to the > > "default" memory policy from 15~17% to 4~5% when the system runs with > > high memory pressure on its fast tier DRAM nodes. > > > > The similar evaluation was done in another machine that has 256GB of > > local DRAM and 96GB of CXL memory. The performance slowdown is reduced > > from 20~24% for "default" to 5~7% for "DAMON 2-tier". > > > > Having these DAMOS_DEMOTE and DAMOS_PROMOTE actions can make 2-tier > > memory systems run more efficiently under high memory pressures. > > Thank you for running the tests again with the new version of the patches and > sharing the results! It's a bit late answer, but the result was from the previous evaluation. I ran it again with RFC v2, but didn't see much difference so just pasted the same result here. > > > > Signed-off-by: Honggyu Kim <honggyu.kim@xxxxxx> > > Signed-off-by: Hyeongtak Ji <hyeongtak.ji@xxxxxx> > > Signed-off-by: Rakie Kim <rakie.kim@xxxxxx> > > > > [1] https://lore.kernel.org/damon/20231112195602.61525-1-sj@xxxxxxxxxx > > [2] https://github.com/skhynix/hmsdk > > [3] https://github.com/redis/redis/tree/7.0.0 > > [4] https://github.com/brianfrankcooper/YCSB/tree/0.17.0 > > [5] https://dl.acm.org/doi/10.1145/3503222.3507731 > > [6] https://dl.acm.org/doi/10.1145/3582016.3582063 > > > > Changes from RFC: > > 1. Move most of implementation from mm/vmscan.c to mm/damon/paddr.c. > > 2. Simplify some functions of vmscan.c and used in paddr.c, but need > > to be reviewed more in depth. > > 3. Refactor most functions for common usage for both promote and > > demote actions and introduce an enum migration_mode for its control. > > 4. Add "target_nid" sysfs knob for migration destination node for both > > promote and demote actions. > > 5. Move DAMOS_PROMOTE before DAMOS_DEMOTE and move then even above > > DAMOS_STAT. > > Thank you very much for addressing many of my comments. Thanks for your feedback in details. > > > > Honggyu Kim (3): > > mm/damon: refactor DAMOS_PAGEOUT with migration_mode > > mm: make alloc_demote_folio externally invokable for migration > > mm/damon: introduce DAMOS_DEMOTE action for demotion > > > > Hyeongtak Ji (4): > > mm/memory-tiers: add next_promotion_node to find promotion target > > mm/damon: introduce DAMOS_PROMOTE action for promotion > > mm/damon/sysfs-schemes: add target_nid on sysfs-schemes > > mm/damon/sysfs-schemes: apply target_nid for promote and demote > > actions > > Honggyu joined DAMON Beer/Coffee/Tea Chat[1] yesterday, and we discussed about > this patchset in high level. Sharing the summary here for open discussion. As > also discussed on the first version of this patchset[2], we want to make single > action for general page migration with minimum changes, but would like to keep > page level access re-check. We also agreed the previously proposed DAMOS > filter-based approach could make sense for the purpose. Thanks very much for the summary. I have been trying to merge promote and demote actions into a single migrate action, but I found an issue regarding damon_pa_scheme_score. It currently calls damon_cold_score() for demote action and damon_hot_score() for promote action, but what should we call when we use a single migrate action? Thanks, Honggyu > Because I was anyway planning making such DAMOS filter for not only > promotion/demotion but other types of DAMOS action, I will start developing the > page level access re-check results based DAMOS filter. Once the implementation > of the prototype is done, I will share the early implementation. Then, Honggyu > will adjust their implementation based on the filter, and run their tests again > and share the results. > > [1] https://lore.kernel.org/damon/20220810225102.124459-1-sj@xxxxxxxxxx/ > [2] https://lore.kernel.org/damon/20240118171756.80356-1-sj@xxxxxxxxxx > > > Thanks, > SJ > > > > > include/linux/damon.h | 15 +- > > include/linux/memory-tiers.h | 11 ++ > > include/linux/migrate_mode.h | 1 + > > include/linux/vm_event_item.h | 1 + > > include/trace/events/migrate.h | 3 +- > > mm/damon/core.c | 5 +- > > mm/damon/dbgfs.c | 2 +- > > mm/damon/lru_sort.c | 3 +- > > mm/damon/paddr.c | 282 ++++++++++++++++++++++++++++++++- > > mm/damon/reclaim.c | 3 +- > > mm/damon/sysfs-schemes.c | 39 ++++- > > mm/internal.h | 1 + > > mm/memory-tiers.c | 43 +++++ > > mm/vmscan.c | 10 +- > > mm/vmstat.c | 1 + > > 15 files changed, 404 insertions(+), 16 deletions(-) > > > > > > base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a > > -- > > 2.34.1