[LSF/MM/BPF TOPIC] Large folio (z)swapin

Usama Arif <usamaarif642@xxxxxxxxx> · Thu, 9 Jan 2025 20:06:43 +0000

I would like to propose a session to discuss the work going on
around large folio swapin, whether its traditional swap or
zswap or zram.

Large folios have obvious advantages that have been discussed before
like fewer page faults, batched PTE and rmap manipulation, reduced
lru list, TLB coalescing (for arm64 and amd).
However, swapping in large folios has its own drawbacks like higher
swap thrashing.
I had initially sent a RFC of zswapin of large folios in [1]
but it causes a regression due to swap thrashing in kernel
build time, which I am confident is happening with zram large
folio swapin as well (which is merged in kernel).

Some of the points we could discuss in the session:

- What is the right (preferably open source) benchmark to test for
swapin of large folios? kernel build time in limited
memory cgroup shows a regression, microbenchmarks show a massive
improvement, maybe there are benchmarks where TLB misses is
a big factor and show an improvement.

- We could have something like
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled
to enable/disable swapin but its going to be difficult to tune, might
have different optimum values based on workloads and are likely to be
left at their default values. Is there some dynamic way to decide when
to swapin large folios and when to fallback to smaller folios?
swapin_readahead swapcache path which only supports 4K folios atm has a
read ahead window based on hits, however readahead is a folio flag and
not a page flag, so this method can't be used as once a large folio
is swapped in, we won't get a fault and subsequent hits on other
pages of the large folio won't be recorded.

- For zswap and zram, it might be that doing larger block compression/
decompression might offset the regression from swap thrashing, but it
brings about its own issues. For e.g. once a large folio is swapped
out, it could fail to swapin as a large folio and fallback
to 4K, resulting in redundant decompressions.
This will also mean swapin of large folios from traditional swap
isn't something we should proceed with?

- Should we even support large folio swapin? You often have high swap
activity when the system/cgroup is close to running out of memory, at this
point, maybe the best way forward is to just swapin 4K pages and let
khugepaged [2], [3] collapse them if the surrounding pages are swapped in
as well.

[1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@xxxxxxxxx/
[2] https://lore.kernel.org/all/20250108233128.14484-1-npache@xxxxxxxxxx/
[3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@xxxxxxx/

Thanks,
Usama