Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin

Barry Song <21cnbao@xxxxxxxxx> · Fri, 10 Jan 2025 23:47:40 +1300

On Fri, Jan 10, 2025 at 11:40 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>
>
>
> On 10/01/2025 10:30, Barry Song wrote:
> > On Fri, Jan 10, 2025 at 11:26 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 10/01/2025 10:09, Barry Song wrote:
> >>> Hi Usama,
> >>>
> >>> Please include me in the discussion. I'll try to attend, at least remotely.
> >>>
> >>> On Fri, Jan 10, 2025 at 9:06 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> >>>>
> >>>> I would like to propose a session to discuss the work going on
> >>>> around large folio swapin, whether its traditional swap or
> >>>> zswap or zram.
> >>>>
> >>>> Large folios have obvious advantages that have been discussed before
> >>>> like fewer page faults, batched PTE and rmap manipulation, reduced
> >>>> lru list, TLB coalescing (for arm64 and amd).
> >>>> However, swapping in large folios has its own drawbacks like higher
> >>>> swap thrashing.
> >>>> I had initially sent a RFC of zswapin of large folios in [1]
> >>>> but it causes a regression due to swap thrashing in kernel
> >>>> build time, which I am confident is happening with zram large
> >>>> folio swapin as well (which is merged in kernel).
> >>>>
> >>>> Some of the points we could discuss in the session:
> >>>>
> >>>> - What is the right (preferably open source) benchmark to test for
> >>>> swapin of large folios? kernel build time in limited
> >>>> memory cgroup shows a regression, microbenchmarks show a massive
> >>>> improvement, maybe there are benchmarks where TLB misses is
> >>>> a big factor and show an improvement.
> >>>
> >>> My understanding is that it largely depends on the workload. In interactive
> >>> scenarios, such as on a phone, swap thrashing is not an issue because
> >>> there is minimal to no thrashing for the app occupying the screen
> >>> (foreground). In such cases, swap bandwidth becomes the most critical factor
> >>> in improving app switching speed, especially when multiple applications
> >>> are switching between background and foreground states.
> >>>
> >>>>
> >>>> - We could have something like
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled
> >>>> to enable/disable swapin but its going to be difficult to tune, might
> >>>> have different optimum values based on workloads and are likely to be
> >>>> left at their default values. Is there some dynamic way to decide when
> >>>> to swapin large folios and when to fallback to smaller folios?
> >>>> swapin_readahead swapcache path which only supports 4K folios atm has a
> >>>> read ahead window based on hits, however readahead is a folio flag and
> >>>> not a page flag, so this method can't be used as once a large folio
> >>>> is swapped in, we won't get a fault and subsequent hits on other
> >>>> pages of the large folio won't be recorded.
> >>>>
> >>>> - For zswap and zram, it might be that doing larger block compression/
> >>>> decompression might offset the regression from swap thrashing, but it
> >>>> brings about its own issues. For e.g. once a large folio is swapped
> >>>> out, it could fail to swapin as a large folio and fallback
> >>>> to 4K, resulting in redundant decompressions.
> >>>
> >>> That's correct. My current workaround involves swapping four small folios,
> >>> and zsmalloc will compress and decompress in chunks of four pages,
> >>> regardless of the actual size of the mTHP - The improvement in compression
> >>> ratio and speed becomes less significant after exceeding four pages, even
> >>> though there is still some increase.
> >>>
> >>> Our recent experiments on phone also show that enabling direct reclamation
> >>> for do_swap_page() to allocate 2-order mTHP results in a 0% allocation
> >>> failure rate -  this probably removes the need for fallbacking to 4 small
> >>> folios. (Note that our experiments include Yu's TAO—Android GKI has
> >>> already merged it. However, since 2 is less than
> >>> PAGE_ALLOC_COSTLY_ORDER, we might achieve similar results even
> >>> without Yu's TAO, although I have not confirmed this.)
> >>>
> >>
> >> Hi Barry,
> >>
> >> Thanks for the comments!
> >>
> >> I haven't seen any activity on TAO on the mailing list recently. Do you know
> >> if there are any plans for it to be sent for upstream review?
> >> Have cc-ed Yu Zhao as well.
> >>
> >>
> >>>> This will also mean swapin of large folios from traditional swap
> >>>> isn't something we should proceed with?
> >>>>
> >>>> - Should we even support large folio swapin? You often have high swap
> >>>> activity when the system/cgroup is close to running out of memory, at this
> >>>> point, maybe the best way forward is to just swapin 4K pages and let
> >>>> khugepaged [2], [3] collapse them if the surrounding pages are swapped in
> >>>> as well.
> >>>
> >>> This approach might be suitable for non-interactive scenarios, such as building
> >>> a kernel within a memory control group (memcg) or running other server
> >>> applications. However, performing collapse in interactive and power-sensitive
> >>> scenarios would be unnecessary and could lead to wasted power due to
> >>> memory migration and unmap/map operations.
> >>>
> >>> However, it is quite challenging to automatically determine the type
> >>> of workloads
> >>> the system is running. I feel we still need a global control to decide whether
> >>> to enable mTHP swap-in—not necessarily per size, but at least at a global level.
> >>> That said, there is evident resistance to introducing additional
> >>> controls to enable
> >>> or disable mTHP features.
> >>>
> >>> By the way, Usama, have you ever tried switching between mglru and the
> >>> traditional
> >>> active/inactive LRU? My experience shows a significant difference in
> >>> swap thrashing
> >>> —active/inactive LRU exhibits much less swap thrashing in my local kernel build
> >>> tests.
> >>>
> >>
> >> I never tried with MGLRU enabled, so I am probably seeing the lowest amount of
> >> swap-thrashing.
> >
> > Are you sure, Usama, since mglru is enabled by default? I have to echo
> > 0 to manually
> > disable it.
> >
>
> Yes, I dont have CONFIG_LRU_GEN set in my defconfig. I dont think it is set
> by default as well? Atleast on x86.
>
> $ make defconfig
> $  grep  LRU_GEN .config
> # CONFIG_LRU_GEN is not set

Okay, it’s likely because I’m using the Ubuntu distribution for x86 and Android
GKI for arm64, where mglru is enabled by default in both cases. But regardless,
I’d appreciate it if you could enable it and check if you observe the same
phenomena as I did :-)

>
> Thanks,
> Usama
>
> >>
> >> Thanks,
> >> Usama
> >>
> >>> the latest mm-unstable
> >>>
> >>> *********** default mglru:   ***********
> >>>
> >>> root@barry-desktop:/home/barry/develop/linux# ./build.sh
> >>> *** Executing round 1 ***
> >>> real 6m44.561s
> >>> user 46m53.274s
> >>> sys 3m48.585s
> >>> pswpin: 1286081
> >>> pswpout: 3147936
> >>> 64kB-swpout: 0
> >>> 32kB-swpout: 0
> >>> 16kB-swpout: 714580
> >>> 64kB-swpin: 0
> >>> 32kB-swpin: 0
> >>> 16kB-swpin: 286881
> >>> pgpgin: 17199072
> >>> pgpgout: 21493892
> >>> swpout_zero: 229163
> >>> swpin_zero: 84353
> >>>
> >>> ******** disable mglru ********
> >>>
> >>> root@barry-desktop:/home/barry/develop/linux# echo 0 >
> >>> /sys/kernel/mm/lru_gen/enabled
> >>>
> >>> root@barry-desktop:/home/barry/develop/linux# ./build.sh
> >>> *** Executing round 1 ***
> >>> real 6m27.944s
> >>> user 46m41.832s
> >>> sys 3m30.635s
> >>> pswpin: 474036
> >>> pswpout: 1434853
> >>> 64kB-swpout: 0
> >>> 32kB-swpout: 0
> >>> 16kB-swpout: 331755
> >>> 64kB-swpin: 0
> >>> 32kB-swpin: 0
> >>> 16kB-swpin: 106333
> >>> pgpgin: 11763720
> >>> pgpgout: 14551524
> >>> swpout_zero: 145050
> >>> swpin_zero: 87981
> >>>
> >>> my build script:
> >>>
> >>> #!/bin/bash
> >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
> >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> >>>
> >>> vmstat_path="/proc/vmstat"
> >>> thp_base_path="/sys/kernel/mm/transparent_hugepage"
> >>>
> >>> read_values() {
> >>>     pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}')
> >>>     pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}')
> >>>     pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}')
> >>>     pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}')
> >>>     swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}')
> >>>     swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}')
> >>>     swpout_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpout
> >>> 2>/dev/null || echo 0)
> >>>     swpout_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpout
> >>> 2>/dev/null || echo 0)
> >>>     swpout_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpout
> >>> 2>/dev/null || echo 0)
> >>>     swpin_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpin
> >>> 2>/dev/null || echo 0)
> >>>     swpin_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpin
> >>> 2>/dev/null || echo 0)
> >>>     swpin_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpin
> >>> 2>/dev/null || echo 0)
> >>>     echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k
> >>> $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout $swpout_zero
> >>> $swpin_zero"
> >>> }
> >>>
> >>> for ((i=1; i<=1; i++))
> >>> do
> >>>   echo
> >>>   echo "*** Executing round $i ***"
> >>>   make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
> >>>   echo 3 > /proc/sys/vm/drop_caches
> >>>
> >>>   #kernel build
> >>>   initial_values=($(read_values))
> >>>   time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
> >>>         CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null
> >>>   final_values=($(read_values))
> >>>
> >>>   echo "pswpin: $((final_values[0] - initial_values[0]))"
> >>>   echo "pswpout: $((final_values[1] - initial_values[1]))"
> >>>   echo "64kB-swpout: $((final_values[2] - initial_values[2]))"
> >>>   echo "32kB-swpout: $((final_values[3] - initial_values[3]))"
> >>>   echo "16kB-swpout: $((final_values[4] - initial_values[4]))"
> >>>   echo "64kB-swpin: $((final_values[5] - initial_values[5]))"
> >>>   echo "32kB-swpin: $((final_values[6] - initial_values[6]))"
> >>>   echo "16kB-swpin: $((final_values[7] - initial_values[7]))"
> >>>   echo "pgpgin: $((final_values[8] - initial_values[8]))"
> >>>   echo "pgpgout: $((final_values[9] - initial_values[9]))"
> >>>   echo "swpout_zero: $((final_values[10] - initial_values[10]))"
> >>>   echo "swpin_zero: $((final_values[11] - initial_values[11]))"
> >>>   sync
> >>>   sleep 10
> >>> done
> >>>
> >>>>
> >>>> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@xxxxxxxxx/
> >>>> [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@xxxxxxxxxx/
> >>>> [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@xxxxxxx/
> >>>>
> >>>> Thanks,
> >>>> Usama
> >>>
> >

Thanks
Barry