Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Thu, 9 Jan 2025 13:34:25 -0800

On Thu, Jan 9, 2025 at 12:06 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>
> I would like to propose a session to discuss the work going on
> around large folio swapin, whether its traditional swap or
> zswap or zram.
>
> Large folios have obvious advantages that have been discussed before
> like fewer page faults, batched PTE and rmap manipulation, reduced
> lru list, TLB coalescing (for arm64 and amd).
> However, swapping in large folios has its own drawbacks like higher
> swap thrashing.
> I had initially sent a RFC of zswapin of large folios in [1]
> but it causes a regression due to swap thrashing in kernel
> build time, which I am confident is happening with zram large
> folio swapin as well (which is merged in kernel).

I am obviously interested in this discussion, but unfortunately I
won't be able to make it this year. I will try to attend remotely
though if possible!

>
> Some of the points we could discuss in the session:
>
> - What is the right (preferably open source) benchmark to test for
> swapin of large folios? kernel build time in limited
> memory cgroup shows a regression, microbenchmarks show a massive
> improvement, maybe there are benchmarks where TLB misses is
> a big factor and show an improvement.
>
> - We could have something like
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled
> to enable/disable swapin but its going to be difficult to tune, might
> have different optimum values based on workloads and are likely to be
> left at their default values. Is there some dynamic way to decide when
> to swapin large folios and when to fallback to smaller folios?
> swapin_readahead swapcache path which only supports 4K folios atm has a
> read ahead window based on hits, however readahead is a folio flag and
> not a page flag, so this method can't be used as once a large folio
> is swapped in, we won't get a fault and subsequent hits on other
> pages of the large folio won't be recorded.
>
> - For zswap and zram, it might be that doing larger block compression/
> decompression might offset the regression from swap thrashing, but it
> brings about its own issues. For e.g. once a large folio is swapped
> out, it could fail to swapin as a large folio and fallback
> to 4K, resulting in redundant decompressions.
> This will also mean swapin of large folios from traditional swap
> isn't something we should proceed with?
>
> - Should we even support large folio swapin? You often have high swap
> activity when the system/cgroup is close to running out of memory, at this
> point, maybe the best way forward is to just swapin 4K pages and let
> khugepaged [2], [3] collapse them if the surrounding pages are swapped in
> as well.
>
> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@xxxxxxxxx/
> [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@xxxxxxxxxx/
> [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@xxxxxxx/
>
> Thanks,
> Usama