Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin

Barry Song <21cnbao@xxxxxxxxx> · Fri, 10 Jan 2025 23:30:29 +1300

On Fri, Jan 10, 2025 at 11:26 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>
>
>
> On 10/01/2025 10:09, Barry Song wrote:
> > Hi Usama,
> >
> > Please include me in the discussion. I'll try to attend, at least remotely.
> >
> > On Fri, Jan 10, 2025 at 9:06 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> >>
> >> I would like to propose a session to discuss the work going on
> >> around large folio swapin, whether its traditional swap or
> >> zswap or zram.
> >>
> >> Large folios have obvious advantages that have been discussed before
> >> like fewer page faults, batched PTE and rmap manipulation, reduced
> >> lru list, TLB coalescing (for arm64 and amd).
> >> However, swapping in large folios has its own drawbacks like higher
> >> swap thrashing.
> >> I had initially sent a RFC of zswapin of large folios in [1]
> >> but it causes a regression due to swap thrashing in kernel
> >> build time, which I am confident is happening with zram large
> >> folio swapin as well (which is merged in kernel).
> >>
> >> Some of the points we could discuss in the session:
> >>
> >> - What is the right (preferably open source) benchmark to test for
> >> swapin of large folios? kernel build time in limited
> >> memory cgroup shows a regression, microbenchmarks show a massive
> >> improvement, maybe there are benchmarks where TLB misses is
> >> a big factor and show an improvement.
> >
> > My understanding is that it largely depends on the workload. In interactive
> > scenarios, such as on a phone, swap thrashing is not an issue because
> > there is minimal to no thrashing for the app occupying the screen
> > (foreground). In such cases, swap bandwidth becomes the most critical factor
> > in improving app switching speed, especially when multiple applications
> > are switching between background and foreground states.
> >
> >>
> >> - We could have something like
> >> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled
> >> to enable/disable swapin but its going to be difficult to tune, might
> >> have different optimum values based on workloads and are likely to be
> >> left at their default values. Is there some dynamic way to decide when
> >> to swapin large folios and when to fallback to smaller folios?
> >> swapin_readahead swapcache path which only supports 4K folios atm has a
> >> read ahead window based on hits, however readahead is a folio flag and
> >> not a page flag, so this method can't be used as once a large folio
> >> is swapped in, we won't get a fault and subsequent hits on other
> >> pages of the large folio won't be recorded.
> >>
> >> - For zswap and zram, it might be that doing larger block compression/
> >> decompression might offset the regression from swap thrashing, but it
> >> brings about its own issues. For e.g. once a large folio is swapped
> >> out, it could fail to swapin as a large folio and fallback
> >> to 4K, resulting in redundant decompressions.
> >
> > That's correct. My current workaround involves swapping four small folios,
> > and zsmalloc will compress and decompress in chunks of four pages,
> > regardless of the actual size of the mTHP - The improvement in compression
> > ratio and speed becomes less significant after exceeding four pages, even
> > though there is still some increase.
> >
> > Our recent experiments on phone also show that enabling direct reclamation
> > for do_swap_page() to allocate 2-order mTHP results in a 0% allocation
> > failure rate -  this probably removes the need for fallbacking to 4 small
> > folios. (Note that our experiments include Yu's TAO—Android GKI has
> > already merged it. However, since 2 is less than
> > PAGE_ALLOC_COSTLY_ORDER, we might achieve similar results even
> > without Yu's TAO, although I have not confirmed this.)
> >
>
> Hi Barry,
>
> Thanks for the comments!
>
> I haven't seen any activity on TAO on the mailing list recently. Do you know
> if there are any plans for it to be sent for upstream review?
> Have cc-ed Yu Zhao as well.
>
>
> >> This will also mean swapin of large folios from traditional swap
> >> isn't something we should proceed with?
> >>
> >> - Should we even support large folio swapin? You often have high swap
> >> activity when the system/cgroup is close to running out of memory, at this
> >> point, maybe the best way forward is to just swapin 4K pages and let
> >> khugepaged [2], [3] collapse them if the surrounding pages are swapped in
> >> as well.
> >
> > This approach might be suitable for non-interactive scenarios, such as building
> > a kernel within a memory control group (memcg) or running other server
> > applications. However, performing collapse in interactive and power-sensitive
> > scenarios would be unnecessary and could lead to wasted power due to
> > memory migration and unmap/map operations.
> >
> > However, it is quite challenging to automatically determine the type
> > of workloads
> > the system is running. I feel we still need a global control to decide whether
> > to enable mTHP swap-in—not necessarily per size, but at least at a global level.
> > That said, there is evident resistance to introducing additional
> > controls to enable
> > or disable mTHP features.
> >
> > By the way, Usama, have you ever tried switching between mglru and the
> > traditional
> > active/inactive LRU? My experience shows a significant difference in
> > swap thrashing
> > —active/inactive LRU exhibits much less swap thrashing in my local kernel build
> > tests.
> >
>
> I never tried with MGLRU enabled, so I am probably seeing the lowest amount of
> swap-thrashing.

Are you sure, Usama, since mglru is enabled by default? I have to echo
0 to manually
disable it.

>
> Thanks,
> Usama
>
> > the latest mm-unstable
> >
> > *********** default mglru:   ***********
> >
> > root@barry-desktop:/home/barry/develop/linux# ./build.sh
> > *** Executing round 1 ***
> > real 6m44.561s
> > user 46m53.274s
> > sys 3m48.585s
> > pswpin: 1286081
> > pswpout: 3147936
> > 64kB-swpout: 0
> > 32kB-swpout: 0
> > 16kB-swpout: 714580
> > 64kB-swpin: 0
> > 32kB-swpin: 0
> > 16kB-swpin: 286881
> > pgpgin: 17199072
> > pgpgout: 21493892
> > swpout_zero: 229163
> > swpin_zero: 84353
> >
> > ******** disable mglru ********
> >
> > root@barry-desktop:/home/barry/develop/linux# echo 0 >
> > /sys/kernel/mm/lru_gen/enabled
> >
> > root@barry-desktop:/home/barry/develop/linux# ./build.sh
> > *** Executing round 1 ***
> > real 6m27.944s
> > user 46m41.832s
> > sys 3m30.635s
> > pswpin: 474036
> > pswpout: 1434853
> > 64kB-swpout: 0
> > 32kB-swpout: 0
> > 16kB-swpout: 331755
> > 64kB-swpin: 0
> > 32kB-swpin: 0
> > 16kB-swpin: 106333
> > pgpgin: 11763720
> > pgpgout: 14551524
> > swpout_zero: 145050
> > swpin_zero: 87981
> >
> > my build script:
> >
> > #!/bin/bash
> > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
> > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> >
> > vmstat_path="/proc/vmstat"
> > thp_base_path="/sys/kernel/mm/transparent_hugepage"
> >
> > read_values() {
> >     pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}')
> >     pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}')
> >     pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}')
> >     pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}')
> >     swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}')
> >     swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}')
> >     swpout_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpout
> > 2>/dev/null || echo 0)
> >     swpout_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpout
> > 2>/dev/null || echo 0)
> >     swpout_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpout
> > 2>/dev/null || echo 0)
> >     swpin_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpin
> > 2>/dev/null || echo 0)
> >     swpin_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpin
> > 2>/dev/null || echo 0)
> >     swpin_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpin
> > 2>/dev/null || echo 0)
> >     echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k
> > $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout $swpout_zero
> > $swpin_zero"
> > }
> >
> > for ((i=1; i<=1; i++))
> > do
> >   echo
> >   echo "*** Executing round $i ***"
> >   make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
> >   echo 3 > /proc/sys/vm/drop_caches
> >
> >   #kernel build
> >   initial_values=($(read_values))
> >   time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
> >         CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null
> >   final_values=($(read_values))
> >
> >   echo "pswpin: $((final_values[0] - initial_values[0]))"
> >   echo "pswpout: $((final_values[1] - initial_values[1]))"
> >   echo "64kB-swpout: $((final_values[2] - initial_values[2]))"
> >   echo "32kB-swpout: $((final_values[3] - initial_values[3]))"
> >   echo "16kB-swpout: $((final_values[4] - initial_values[4]))"
> >   echo "64kB-swpin: $((final_values[5] - initial_values[5]))"
> >   echo "32kB-swpin: $((final_values[6] - initial_values[6]))"
> >   echo "16kB-swpin: $((final_values[7] - initial_values[7]))"
> >   echo "pgpgin: $((final_values[8] - initial_values[8]))"
> >   echo "pgpgout: $((final_values[9] - initial_values[9]))"
> >   echo "swpout_zero: $((final_values[10] - initial_values[10]))"
> >   echo "swpin_zero: $((final_values[11] - initial_values[11]))"
> >   sync
> >   sleep 10
> > done
> >
> >>
> >> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@xxxxxxxxx/
> >> [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@xxxxxxxxxx/
> >> [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@xxxxxxx/
> >>
> >> Thanks,
> >> Usama
> >

Thanks
Barry