Re: [PATCH v3 1/3] mm, lru_gen: try to prefetch next page when scanning LRU

Chris Li <chrisl@xxxxxxxxxx> · Wed, 24 Jan 2024 23:32:51 -0800

On Tue, Jan 23, 2024 at 10:46 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> From: Kairui Song <kasong@xxxxxxxxxxx>
>
> Prefetch for inactive/active LRU have been long exiting, apply the same
> optimization for MGLRU.
>
> Test 1: Ramdisk fio ro test in a 4G memcg on a EPYC 7K62:
>   fio -name=mglru --numjobs=16 --directory=/mnt --size=960m \
>     --buffered=1 --ioengine=io_uring --iodepth=128 \
>     --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
>     --rw=randread --random_distribution=zipf:0.5 --norandommap \
>     --time_based --ramp_time=1m --runtime=6m --group_reporting
>
> Before this patch:
> bw (  MiB/s): min= 7758, max= 9239, per=100.00%, avg=8747.59, stdev=16.51, samples=11488
> iops        : min=1986251, max=2365323, avg=2239380.87, stdev=4225.93, samples=11488
>
> After this patch (+7.2%):
> bw (  MiB/s): min= 8360, max= 9771, per=100.00%, avg=9381.31, stdev=15.67, samples=11488
> iops        : min=2140296, max=2501385, avg=2401613.91, stdev=4010.41, samples=11488
>
> Test 2: Ramdisk fio hybrid test for 30m in a 4G memcg on a EPYC 7K62 (3 times):
>   fio --buffered=1 --numjobs=8 --size=960m --directory=/mnt \
>     --time_based --ramp_time=1m --runtime=30m \
>     --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \
>     --iodepth_batch_complete=32 --norandommap \
>     --name=mglru-ro --rw=randread --random_distribution=zipf:0.7 \
>     --name=mglru-rw --rw=randrw --random_distribution=zipf:0.7
>
> Before this patch:
>  READ: 6622.0 MiB/s. Stdev: 22.090722
> WRITE: 1256.3 MiB/s. Stdev: 5.249339
>
> After this patch (+4.6%, +3.3%):
>  READ: 6926.6 MiB/s, Stdev: 37.950260
> WRITE: 1297.3 MiB/s, Stdev: 7.408704
>
> Test 3: 30m of MySQL test in 6G memcg (12 times):
>   echo 'set GLOBAL innodb_buffer_pool_size=16106127360;' | \
>     mysql -u USER -h localhost --password=PASS
>
>   sysbench /usr/share/sysbench/oltp_read_only.lua \
>     --mysql-user=USER --mysql-password=PASS --mysql-db=DB \
>     --tables=48 --table-size=2000000 --threads=16 --time=1800 run
>
> Before this patch
> Avg: 134743.714545 qps. Stdev: 582.242189
>
> After this patch (+0.2%):
> Avg: 135005.779091 qps. Stdev: 295.299027
>
> Test 4: Build linux kernel in 2G memcg with make -j48 with SSD swap
>         (for memory stress, 18 times):
>
> Before this patch:
> Avg: 1456.768899 s. Stdev: 20.106973
>
> After this patch (+0.0%):
> Avg: 1455.659254 s. Stdev: 15.274481
>
> Test 5: Memtier test in a 4G cgroup using brd as swap (18 times):
>   memcached -u nobody -m 16384 -s /tmp/memcached.socket \
>     -a 0766 -t 16 -B binary &
>   memtier_benchmark -S /tmp/memcached.socket \
>     -P memcache_binary -n allkeys \
>     --key-minimum=1 --key-maximum=16000000 -d 1024 \
>     --ratio=1:0 --key-pattern=P:P -c 1 -t 16 --pipeline 8 -x 3
>
> Before this patch:
> Avg: 50317.984000 Ops/sec. Stdev: 2568.965458
>
> After this patch (-5.7%):
> Avg: 47691.343500 Ops/sec. Stdev: 3925.772473
>
> It seems prefetch is helpful in most cases, but the memtier test is
> either hitting a case where prefetch causes higher cache miss or it's
> just too noisy (high stdev).
>
> Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx>
> ---
>  mm/vmscan.c | 30 ++++++++++++++++++++++++++----
>  1 file changed, 26 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f9c854ce6cc..03631cedb3ab 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3681,15 +3681,26 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
>         /* prevent cold/hot inversion if force_scan is true */
>         for (zone = 0; zone < MAX_NR_ZONES; zone++) {
>                 struct list_head *head = &lrugen->folios[old_gen][type][zone];
> +               struct folio *prev = NULL;
>
> -               while (!list_empty(head)) {
> -                       struct folio *folio = lru_to_folio(head);
> +               if (!list_empty(head))
> +                       prev = lru_to_folio(head);
> +
> +               while (prev) {
> +                       struct folio *folio = prev;
>
>                         VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
>                         VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
>                         VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
>                         VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
>
> +                       if (unlikely(list_is_first(&folio->lru, head))) {
> +                               prev = NULL;
> +                       } else {
> +                               prev = lru_to_folio(&folio->lru);
> +                               prefetchw(&prev->flags);
> +                       }

This makes the code flow much harder to follow. Also for architecture
that does not support prefetch, this will be a net loss.

Can you use refetchw_prev_lru_folio() instead? It will make the code
much easier to follow. It also turns into no-op when prefetch is not
supported.

Chris

> +
>                         new_gen = folio_inc_gen(lruvec, folio, false);
>                         list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
>
> @@ -4341,11 +4352,15 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>         for (i = MAX_NR_ZONES; i > 0; i--) {
>                 LIST_HEAD(moved);
>                 int skipped_zone = 0;
> +               struct folio *prev = NULL;
>                 int zone = (sc->reclaim_idx + i) % MAX_NR_ZONES;
>                 struct list_head *head = &lrugen->folios[gen][type][zone];
>
> -               while (!list_empty(head)) {
> -                       struct folio *folio = lru_to_folio(head);
> +               if (!list_empty(head))
> +                       prev = lru_to_folio(head);
> +
> +               while (prev) {
> +                       struct folio *folio = prev;
>                         int delta = folio_nr_pages(folio);
>
>                         VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
> @@ -4355,6 +4370,13 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>
>                         scanned += delta;
>
> +                       if (unlikely(list_is_first(&folio->lru, head))) {
> +                               prev = NULL;
> +                       } else {
> +                               prev = lru_to_folio(&folio->lru);
> +                               prefetchw(&prev->flags);
> +                       }
> +
>                         if (sort_folio(lruvec, folio, sc, tier))
>                                 sorted += delta;
>                         else if (isolate_folio(lruvec, folio, sc)) {
> --
> 2.43.0
>
>