From: Kairui Song <kasong@xxxxxxxxxxx> Link V1: https://lore.kernel.org/linux-mm/20231222102255.56993-1-ryncsn@xxxxxxxxx/ Link V2: https://lore.kernel.org/linux-mm/20240111183321.19984-1-ryncsn@xxxxxxxxx/ Currently when MGLRU ages, it moves the pages one by one and updates mm counter page by page, which is correct but the overhead can be optimized by batching these operations. I did a rebase and applied more tests to see if there are any regressions or improvements, it seems everything looks OK except the memtier test where I tuned down the repeat time (-x) compared to V1 and V2 and simply test more times instead. It now seems to have a minor regression. If it's true, it's caused by the prefetch patch. But the noise (Standard Deviation) is a bit high so not sure if that test is credible. The test result of each individual patch is in the commit message. Test 1: Ramdisk fio ro test in a 4G memcg on a EPYC 7K62: fio -name=mglru --numjobs=16 --directory=/mnt --size=960m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:0.5 --norandommap \ --time_based --ramp_time=1m --runtime=6m --group_reporting Before this series: bw ( MiB/s): min= 7758, max= 9239, per=100.00%, avg=8747.59, stdev=16.51, samples=11488 iops : min=1986251, max=2365323, avg=2239380.87, stdev=4225.93, samples=11488 After this series (+7.1%): bw ( MiB/s): min= 8359, max= 9796, per=100.00%, avg=9367.29, stdev=15.75, samples=11488 iops : min=2140113, max=2507928, avg=2398024.65, stdev=4033.07, samples=11488 Test 2: Ramdisk fio hybrid test for 30m in a 4G memcg on a EPYC 7K62 (3 times): fio --buffered=1 --numjobs=8 --size=960m --directory=/mnt \ --time_based --ramp_time=1m --runtime=30m \ --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \ --iodepth_batch_complete=32 --norandommap \ --name=mglru-ro --rw=randread --random_distribution=zipf:0.7 \ --name=mglru-rw --rw=randrw --random_distribution=zipf:0.7 Before this series: READ: 6622.0 MiB/s, Stdev: 22.090722 WRITE: 1256.3 MiB/s, Stdev: 5.249339 After this series (+5.4%, +3.9%): READ: 6981.0 MiB/s, Stdev: 15.556349 WRITE: 1305.7 MiB/s, Stdev: 2.357023 Test 3: 30m of MySQL test in 6G memcg with swap (12 times): echo 'set GLOBAL innodb_buffer_pool_size=16106127360;' | \ mysql -u USER -h localhost --password=PASS sysbench /usr/share/sysbench/oltp_read_only.lua \ --mysql-user=USER --mysql-password=PASS --mysql-db=DB \ --tables=48 --table-size=2000000 --threads=16 --time=1800 run Before this series Avg: 134743.714545 qps. Stdev: 582.242189 After this series (+0.3%): Avg: 135099.210000 qps. Stdev: 351.488863 Test 4: Build linux kernel in 2G memcg with make -j48 with swap (for memory stress, 18 times): Before this series: Avg: 1456.768899 s. Stdev: 20.106973 After this series (-0.5%): Avg: 1464.178154 s. Stdev: 17.992974 Test 5: Memtier test in a 4G cgroup using brd as swap (18 times): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 16 -B binary & memtier_benchmark -S /tmp/memcached.socket \ -P memcache_binary -n allkeys \ --key-minimum=1 --key-maximum=16000000 -d 1024 \ --ratio=1:0 --key-pattern=P:P -c 1 -t 16 --pipeline 8 -x 3 Before this series: Avg: 50317.984000 Ops/sec. Stdev: 2568.965458 After this series (-2.7%): Avg: 48959.374118 Ops/sec. Stdev: 3488.559744 Updates from V2: - Add more tests and simplify patch 2/3 to contain only one gen info for batch, as Wei Xu suggests that the batch struct may use too much stack. - Add more tests, and test individual patch as requested by Wei Xu. - Fix typo as pointed out by Andrew Morton. Update from V1: - Fix function argument type as suggested by Chris Li. Kairui Song (3): mm, lru_gen: try to prefetch next page when scanning LRU mm, lru_gen: batch update counters on aging mm, lru_gen: move pages in bulk when aging mm/vmscan.c | 145 ++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 125 insertions(+), 20 deletions(-) -- 2.43.0