An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs, since each node and memcg combination has an LRU of folios (see mem_cgroup_lruvec()). Its goal is to improve the scalability of global reclaim, which is critical to systemwide memory overcommit in data centers. Note that memcg reclaim is currently out of scope. Its memory bloat is a pointer to each LRU vector and negligible to each node. In terms of traversing memcgs during global reclaim, it improves the best-case complexity from O(n) to O(1) and does not affect the worst-case complexity O(n). Therefore, on average, it has a sublinear complexity in contrast to the current linear complexity. The basic structure of an memcg LRU can be understood by an analogy to the active/inactive LRU (of folios): 1. It has the young and the old (generations); 2. Its linked lists have the head and the tail; 3. The increment of max_seq triggers promotion; 4. Other events, e.g., offlining an memcg, triggers similar operations. In terms of global reclaim, it has two distinct features: 1. Sharding, which allows each thread to start at a random memcg (in the old generation) and improves parallelism; 2. Eventual fairness, which allows direct reclaim to bail out and reduces latency without affecting fairness over some time. The commit message in patch 6 details the workflow: https://lore.kernel.org/r/20221201223923.873696-7-yuzhao@xxxxxxxxxx/ The following is a simple test to quickly verify its effectiveness. More benchmarks are coming soon. Test design: 1. Create multiple memcgs. 2. Each memcg contains a job (fio). 3. All jobs access the same amount of memory randomly. 4. The system does not experience global memory pressure. 5. Periodically write to the root memory.reclaim. Desired outcome: 1. All memcgs have similar pgsteal, i.e., stddev(pgsteal)/mean(pgsteal) is close to 0%. 2. The total pgsteal is close to the total requested through memory.reclaim, i.e., sum(pgsteal)/sum(requested) is close to 100%. Actual outcome [1]: stddev(pgsteal)/mean(pgsteal) sum(pgsteal)/sum(requested) MGLRU off 75% 425% MGLRU on 20% 95% #################################################################### MEMCGS=128 for ((memcg = 0; memcg < $MEMCGS; memcg++)); do mkdir /sys/fs/cgroup/memcg$memcg done start() { echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \ --filename=/dev/zero --size=1920M --rw=randrw \ --rate=64m,64m --random_distribution=random \ --fadvise_hint=0 --time_based --runtime=10h \ --group_reporting --minimal } for ((memcg = 0; memcg < $MEMCGS; memcg++)); do start & done sleep 600 for ((i = 0; i < 600; i++)); do echo 256m >/sys/fs/cgroup/memory.reclaim sleep 6 done for ((memcg = 0; memcg < $MEMCGS; memcg++)); do grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat done #################################################################### [1]: This was obtained from running the above script (touches less than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an hour. Yu Zhao (8): mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[] mm: multi-gen LRU: remove eviction fairness safeguard mm: multi-gen LRU: remove aging fairness safeguard mm: multi-gen LRU: shuffle should_run_aging() mm: multi-gen LRU: per-node lru_gen_folio lists mm: multi-gen LRU: clarify scan_control flags mm: multi-gen LRU: simplify arch_has_hw_pte_young() check Documentation/mm/multigen_lru.rst | 8 +- include/linux/memcontrol.h | 10 + include/linux/mm_inline.h | 25 +- include/linux/mmzone.h | 127 ++++- mm/memcontrol.c | 16 + mm/page_alloc.c | 1 + mm/vmscan.c | 765 ++++++++++++++++++++---------- mm/workingset.c | 4 +- 8 files changed, 687 insertions(+), 269 deletions(-) -- 2.39.0.rc0.267.gcb52ba06e7-goog