On 10/01/2025 10:30, Barry Song wrote: > On Fri, Jan 10, 2025 at 11:26 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >> >> >> >> On 10/01/2025 10:09, Barry Song wrote: >>> Hi Usama, >>> >>> Please include me in the discussion. I'll try to attend, at least remotely. >>> >>> On Fri, Jan 10, 2025 at 9:06 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >>>> >>>> I would like to propose a session to discuss the work going on >>>> around large folio swapin, whether its traditional swap or >>>> zswap or zram. >>>> >>>> Large folios have obvious advantages that have been discussed before >>>> like fewer page faults, batched PTE and rmap manipulation, reduced >>>> lru list, TLB coalescing (for arm64 and amd). >>>> However, swapping in large folios has its own drawbacks like higher >>>> swap thrashing. >>>> I had initially sent a RFC of zswapin of large folios in [1] >>>> but it causes a regression due to swap thrashing in kernel >>>> build time, which I am confident is happening with zram large >>>> folio swapin as well (which is merged in kernel). >>>> >>>> Some of the points we could discuss in the session: >>>> >>>> - What is the right (preferably open source) benchmark to test for >>>> swapin of large folios? kernel build time in limited >>>> memory cgroup shows a regression, microbenchmarks show a massive >>>> improvement, maybe there are benchmarks where TLB misses is >>>> a big factor and show an improvement. >>> >>> My understanding is that it largely depends on the workload. In interactive >>> scenarios, such as on a phone, swap thrashing is not an issue because >>> there is minimal to no thrashing for the app occupying the screen >>> (foreground). In such cases, swap bandwidth becomes the most critical factor >>> in improving app switching speed, especially when multiple applications >>> are switching between background and foreground states. >>> >>>> >>>> - We could have something like >>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled >>>> to enable/disable swapin but its going to be difficult to tune, might >>>> have different optimum values based on workloads and are likely to be >>>> left at their default values. Is there some dynamic way to decide when >>>> to swapin large folios and when to fallback to smaller folios? >>>> swapin_readahead swapcache path which only supports 4K folios atm has a >>>> read ahead window based on hits, however readahead is a folio flag and >>>> not a page flag, so this method can't be used as once a large folio >>>> is swapped in, we won't get a fault and subsequent hits on other >>>> pages of the large folio won't be recorded. >>>> >>>> - For zswap and zram, it might be that doing larger block compression/ >>>> decompression might offset the regression from swap thrashing, but it >>>> brings about its own issues. For e.g. once a large folio is swapped >>>> out, it could fail to swapin as a large folio and fallback >>>> to 4K, resulting in redundant decompressions. >>> >>> That's correct. My current workaround involves swapping four small folios, >>> and zsmalloc will compress and decompress in chunks of four pages, >>> regardless of the actual size of the mTHP - The improvement in compression >>> ratio and speed becomes less significant after exceeding four pages, even >>> though there is still some increase. >>> >>> Our recent experiments on phone also show that enabling direct reclamation >>> for do_swap_page() to allocate 2-order mTHP results in a 0% allocation >>> failure rate - this probably removes the need for fallbacking to 4 small >>> folios. (Note that our experiments include Yu's TAO—Android GKI has >>> already merged it. However, since 2 is less than >>> PAGE_ALLOC_COSTLY_ORDER, we might achieve similar results even >>> without Yu's TAO, although I have not confirmed this.) >>> >> >> Hi Barry, >> >> Thanks for the comments! >> >> I haven't seen any activity on TAO on the mailing list recently. Do you know >> if there are any plans for it to be sent for upstream review? >> Have cc-ed Yu Zhao as well. >> >> >>>> This will also mean swapin of large folios from traditional swap >>>> isn't something we should proceed with? >>>> >>>> - Should we even support large folio swapin? You often have high swap >>>> activity when the system/cgroup is close to running out of memory, at this >>>> point, maybe the best way forward is to just swapin 4K pages and let >>>> khugepaged [2], [3] collapse them if the surrounding pages are swapped in >>>> as well. >>> >>> This approach might be suitable for non-interactive scenarios, such as building >>> a kernel within a memory control group (memcg) or running other server >>> applications. However, performing collapse in interactive and power-sensitive >>> scenarios would be unnecessary and could lead to wasted power due to >>> memory migration and unmap/map operations. >>> >>> However, it is quite challenging to automatically determine the type >>> of workloads >>> the system is running. I feel we still need a global control to decide whether >>> to enable mTHP swap-in—not necessarily per size, but at least at a global level. >>> That said, there is evident resistance to introducing additional >>> controls to enable >>> or disable mTHP features. >>> >>> By the way, Usama, have you ever tried switching between mglru and the >>> traditional >>> active/inactive LRU? My experience shows a significant difference in >>> swap thrashing >>> —active/inactive LRU exhibits much less swap thrashing in my local kernel build >>> tests. >>> >> >> I never tried with MGLRU enabled, so I am probably seeing the lowest amount of >> swap-thrashing. > > Are you sure, Usama, since mglru is enabled by default? I have to echo > 0 to manually > disable it. > Yes, I dont have CONFIG_LRU_GEN set in my defconfig. I dont think it is set by default as well? Atleast on x86. $ make defconfig $ grep LRU_GEN .config # CONFIG_LRU_GEN is not set Thanks, Usama >> >> Thanks, >> Usama >> >>> the latest mm-unstable >>> >>> *********** default mglru: *********** >>> >>> root@barry-desktop:/home/barry/develop/linux# ./build.sh >>> *** Executing round 1 *** >>> real 6m44.561s >>> user 46m53.274s >>> sys 3m48.585s >>> pswpin: 1286081 >>> pswpout: 3147936 >>> 64kB-swpout: 0 >>> 32kB-swpout: 0 >>> 16kB-swpout: 714580 >>> 64kB-swpin: 0 >>> 32kB-swpin: 0 >>> 16kB-swpin: 286881 >>> pgpgin: 17199072 >>> pgpgout: 21493892 >>> swpout_zero: 229163 >>> swpin_zero: 84353 >>> >>> ******** disable mglru ******** >>> >>> root@barry-desktop:/home/barry/develop/linux# echo 0 > >>> /sys/kernel/mm/lru_gen/enabled >>> >>> root@barry-desktop:/home/barry/develop/linux# ./build.sh >>> *** Executing round 1 *** >>> real 6m27.944s >>> user 46m41.832s >>> sys 3m30.635s >>> pswpin: 474036 >>> pswpout: 1434853 >>> 64kB-swpout: 0 >>> 32kB-swpout: 0 >>> 16kB-swpout: 331755 >>> 64kB-swpin: 0 >>> 32kB-swpin: 0 >>> 16kB-swpin: 106333 >>> pgpgin: 11763720 >>> pgpgout: 14551524 >>> swpout_zero: 145050 >>> swpin_zero: 87981 >>> >>> my build script: >>> >>> #!/bin/bash >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >>> >>> vmstat_path="/proc/vmstat" >>> thp_base_path="/sys/kernel/mm/transparent_hugepage" >>> >>> read_values() { >>> pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}') >>> pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}') >>> pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}') >>> pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}') >>> swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}') >>> swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}') >>> swpout_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpout >>> 2>/dev/null || echo 0) >>> swpout_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpout >>> 2>/dev/null || echo 0) >>> swpout_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpout >>> 2>/dev/null || echo 0) >>> swpin_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpin >>> 2>/dev/null || echo 0) >>> swpin_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpin >>> 2>/dev/null || echo 0) >>> swpin_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpin >>> 2>/dev/null || echo 0) >>> echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k >>> $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout $swpout_zero >>> $swpin_zero" >>> } >>> >>> for ((i=1; i<=1; i++)) >>> do >>> echo >>> echo "*** Executing round $i ***" >>> make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null >>> echo 3 > /proc/sys/vm/drop_caches >>> >>> #kernel build >>> initial_values=($(read_values)) >>> time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ >>> CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null >>> final_values=($(read_values)) >>> >>> echo "pswpin: $((final_values[0] - initial_values[0]))" >>> echo "pswpout: $((final_values[1] - initial_values[1]))" >>> echo "64kB-swpout: $((final_values[2] - initial_values[2]))" >>> echo "32kB-swpout: $((final_values[3] - initial_values[3]))" >>> echo "16kB-swpout: $((final_values[4] - initial_values[4]))" >>> echo "64kB-swpin: $((final_values[5] - initial_values[5]))" >>> echo "32kB-swpin: $((final_values[6] - initial_values[6]))" >>> echo "16kB-swpin: $((final_values[7] - initial_values[7]))" >>> echo "pgpgin: $((final_values[8] - initial_values[8]))" >>> echo "pgpgout: $((final_values[9] - initial_values[9]))" >>> echo "swpout_zero: $((final_values[10] - initial_values[10]))" >>> echo "swpin_zero: $((final_values[11] - initial_values[11]))" >>> sync >>> sleep 10 >>> done >>> >>>> >>>> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@xxxxxxxxx/ >>>> [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@xxxxxxxxxx/ >>>> [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@xxxxxxx/ >>>> >>>> Thanks, >>>> Usama >>> > > Thanks > Barry