Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin

Barry Song <21cnbao@xxxxxxxxx> · Fri, 10 Jan 2025 23:28:43 +1300

On Fri, Jan 10, 2025 at 5:29 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Fri, Jan 10, 2025 at 3:08 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> >
> > I would like to propose a session to discuss the work going on
> > around large folio swapin, whether its traditional swap or
> > zswap or zram.
>
> I'm interested! Count me in the discussion :)
>
> >
> > Large folios have obvious advantages that have been discussed before
> > like fewer page faults, batched PTE and rmap manipulation, reduced
> > lru list, TLB coalescing (for arm64 and amd).
> > However, swapping in large folios has its own drawbacks like higher
> > swap thrashing.
> > I had initially sent a RFC of zswapin of large folios in [1]
> > but it causes a regression due to swap thrashing in kernel
> > build time, which I am confident is happening with zram large
> > folio swapin as well (which is merged in kernel).
> >
> > Some of the points we could discuss in the session:
> >
> > - What is the right (preferably open source) benchmark to test for
> > swapin of large folios? kernel build time in limited
> > memory cgroup shows a regression, microbenchmarks show a massive
> > improvement, maybe there are benchmarks where TLB misses is
> > a big factor and show an improvement.
> >
> > - We could have something like
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled
> > to enable/disable swapin but its going to be difficult to tune, might
> > have different optimum values based on workloads and are likely to be
>
> Might even be different across memory regions.
>
> > left at their default values. Is there some dynamic way to decide when
> > to swapin large folios and when to fallback to smaller folios?
> > swapin_readahead swapcache path which only supports 4K folios atm has a
> > read ahead window based on hits, however readahead is a folio flag and
> > not a page flag, so this method can't be used as once a large folio
> > is swapped in, we won't get a fault and subsequent hits on other
> > pages of the large folio won't be recorded.
>
> Is this beneficial/useful enough to make it into a page flag?
>
> Can we push this to the swap layer, i.e record the hit information on
> a per-swap-entry basis instead? The space is a bit tight, but we're
> already in the talk for the new swap abstraction layer. If we go the
> dynamic route, we can squeeze this kind of information in the
> dynamically allocated per-swap-entry metadata structure (swap
> descriptor?).
>
> However, the swap entry can go away after a swapin (see
> should_try_to_free_swap()), so that might be busted :)
>
> >
> > - For zswap and zram, it might be that doing larger block compression/
> > decompression might offset the regression from swap thrashing, but it
> > brings about its own issues. For e.g. once a large folio is swapped
> > out, it could fail to swapin as a large folio and fallback
> > to 4K, resulting in redundant decompressions.
> > This will also mean swapin of large folios from traditional swap
> > isn't something we should proceed with?
>
> Yeah the cost/benefit analysis differs between backend. I wonder if a
> one-size-fit-all, backend-agnostic policy could ever work - maybe we
> need some backend-driven algorithm, or some sort of hinting mechanism?
>
> This would make the logic uglier though. We've been here before with
> HDD and SSD swap, except we don't really care about the former, so we
> can prioritize optimizing for SSD swap (in fact looks like we're
> removing the HDD portion of the swap allocator). In this case however,
> zswap, zram, and SSD swap are all valid options, with different
> characteristics that can make the optimal decision differ :)
>
> If we're going the block (de)compression route, there is also this
> pesky block size question. For instance, do we want to store the
> entire 2MB in a single block? That would mean we need to decompress
> the entire 2MB block at load time. It might be more straightforward in
> the mTHP world, but we do need to consider 2MB THP users too.

I don't think we need to save the entire 2MB in a single block. After 64KB,
we don't see much improvement in compression ratio or speed. The most
significant increase was observed between 4KB and 16KB.

For example, for zstd:

File size: 182502912 bytes

4KB Block: Compression time = 0.967303 seconds, Decompression time =
0.200064 seconds
  Original size: 182502912 bytes
  Compressed size: 66089193 bytes
  Compression ratio: 36.21%

16KB Block: Compression time = 0.567167 seconds, Decompression time =
0.152807 seconds
  Original size: 182502912 bytes
  Compressed size: 59159073 bytes
  Compression ratio: 32.42%

32KB Block: Compression time = 0.543887 seconds, Decompression time =
0.136602 seconds
  Original size: 182502912 bytes
  Compressed size: 57958701 bytes
  Compression ratio: 31.76%

64KB Block: Compression time = 0.536979 seconds, Decompression time =
0.127069 seconds
  Original size: 182502912 bytes
  Compressed size: 56700795 bytes
  Compression ratio: 31.07%

128KB Block: Compression time = 0.540505 seconds, Decompression time =
0.120685 seconds
  Original size: 182502912 bytes
  Compressed size: 55765775 bytes
  Compression ratio: 30.56%

256KB Block: Compression time = 0.575515 seconds, Decompression time =
0.125049 seconds
  Original size: 182502912 bytes
  Compressed size: 54203461 bytes
  Compression ratio: 29.70%

512KB Block: Compression time = 0.571370 seconds, Decompression time =
0.119609 seconds
  Original size: 182502912 bytes
  Compressed size: 53914422 bytes
  Compression ratio: 29.54%

1024KB Block: Compression time = 0.556631 seconds, Decompression time
= 0.119475 seconds
  Original size: 182502912 bytes
  Compressed size: 53239893 bytes
  Compression ratio: 29.17%

2048KB Block: Compression time = 0.539796 seconds, Decompression time
= 0.119751 seconds
  Original size: 182502912 bytes
  Compressed size: 52923234 bytes
  Compression ratio: 29.00%

To simplify things(Reduce the potential decompression of large blocks for small
swap-ins), for a 2MB THP, we are actually saving it as 2MB/16KB blocks in
zsmalloc, as shown in the RFC.

https://lore.kernel.org/linux-mm/20241121222521.83458-1-21cnbao@xxxxxxxxx/

>
> Finally, the calculus might change once large folio allocation becomes
> more reliable. Perhaps we can wait until Johannes and Yu make this
> work?
>
> >
> > - Should we even support large folio swapin? You often have high swap
> > activity when the system/cgroup is close to running out of memory, at this
> > point, maybe the best way forward is to just swapin 4K pages and let
> > khugepaged [2], [3] collapse them if the surrounding pages are swapped in
> > as well.
>
> Perhaps this is the easiest thing to do :)

Thanks
barry