Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin

Nhat Pham <nphamcs@xxxxxxxxx> · Fri, 10 Jan 2025 11:29:23 +0700

On Fri, Jan 10, 2025 at 3:08 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>
> I would like to propose a session to discuss the work going on
> around large folio swapin, whether its traditional swap or
> zswap or zram.

I'm interested! Count me in the discussion :)

>
> Large folios have obvious advantages that have been discussed before
> like fewer page faults, batched PTE and rmap manipulation, reduced
> lru list, TLB coalescing (for arm64 and amd).
> However, swapping in large folios has its own drawbacks like higher
> swap thrashing.
> I had initially sent a RFC of zswapin of large folios in [1]
> but it causes a regression due to swap thrashing in kernel
> build time, which I am confident is happening with zram large
> folio swapin as well (which is merged in kernel).
>
> Some of the points we could discuss in the session:
>
> - What is the right (preferably open source) benchmark to test for
> swapin of large folios? kernel build time in limited
> memory cgroup shows a regression, microbenchmarks show a massive
> improvement, maybe there are benchmarks where TLB misses is
> a big factor and show an improvement.
>
> - We could have something like
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled
> to enable/disable swapin but its going to be difficult to tune, might
> have different optimum values based on workloads and are likely to be

Might even be different across memory regions.

> left at their default values. Is there some dynamic way to decide when
> to swapin large folios and when to fallback to smaller folios?
> swapin_readahead swapcache path which only supports 4K folios atm has a
> read ahead window based on hits, however readahead is a folio flag and
> not a page flag, so this method can't be used as once a large folio
> is swapped in, we won't get a fault and subsequent hits on other
> pages of the large folio won't be recorded.

Is this beneficial/useful enough to make it into a page flag?

Can we push this to the swap layer, i.e record the hit information on
a per-swap-entry basis instead? The space is a bit tight, but we're
already in the talk for the new swap abstraction layer. If we go the
dynamic route, we can squeeze this kind of information in the
dynamically allocated per-swap-entry metadata structure (swap
descriptor?).

However, the swap entry can go away after a swapin (see
should_try_to_free_swap()), so that might be busted :)

>
> - For zswap and zram, it might be that doing larger block compression/
> decompression might offset the regression from swap thrashing, but it
> brings about its own issues. For e.g. once a large folio is swapped
> out, it could fail to swapin as a large folio and fallback
> to 4K, resulting in redundant decompressions.
> This will also mean swapin of large folios from traditional swap
> isn't something we should proceed with?

Yeah the cost/benefit analysis differs between backend. I wonder if a
one-size-fit-all, backend-agnostic policy could ever work - maybe we
need some backend-driven algorithm, or some sort of hinting mechanism?

This would make the logic uglier though. We've been here before with
HDD and SSD swap, except we don't really care about the former, so we
can prioritize optimizing for SSD swap (in fact looks like we're
removing the HDD portion of the swap allocator). In this case however,
zswap, zram, and SSD swap are all valid options, with different
characteristics that can make the optimal decision differ :)

If we're going the block (de)compression route, there is also this
pesky block size question. For instance, do we want to store the
entire 2MB in a single block? That would mean we need to decompress
the entire 2MB block at load time. It might be more straightforward in
the mTHP world, but we do need to consider 2MB THP users too.

Finally, the calculus might change once large folio allocation becomes
more reliable. Perhaps we can wait until Johannes and Yu make this
work?

>
> - Should we even support large folio swapin? You often have high swap
> activity when the system/cgroup is close to running out of memory, at this
> point, maybe the best way forward is to just swapin 4K pages and let
> khugepaged [2], [3] collapse them if the surrounding pages are swapped in
> as well.

Perhaps this is the easiest thing to do :)