Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Tue, 28 Mar 2023 12:59:55 -0700

On Tue, Mar 28, 2023 at 7:14 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Tue, Mar 28, 2023 at 12:59:31AM -0700, Yosry Ahmed wrote:
> > On Tue, Mar 28, 2023 at 12:01 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> > > Yosry Ahmed <yosryahmed@xxxxxxxxxx> writes:
> > > > We also have to unnecessarily limit the size of zswap with the size of
> > > > this fake swapfile.
> > >
> > > I guess you need to limit the size of zswap anyway, because you need to
> > > decide when to start to writeback or moving to the lower tiers.
> >
> > zswap has a knob to limit its size, but based on the actual memory
> > usage of zswap (i.e the size of compressed pages). There is ongoing
> > work as well to autotune this if I remember correctly. Having to deal
> > with both the limit on compressed memory and the limited on the
> > uncompressed size of swapped pages is cumbersome. Again, we already
> > have this behavior today, but the initial swap_desc proposal aimed to
> > avoid it.
>
> Right.
>
> The optimal size of the zswap pool on top of a swapfile depends on the
> size and compressibility of the warm set of the workload: data that's
> too cold for regular memory yet too hot for swap. This is obviously
> highly dynamic, and even varies over time within individual jobs.
>
> With this proposal, we'd have to provision a static swap map for the
> highest expected offloading rate and compression ratio on every host
> of a shared pool. On 256G machines that would put the fixed overhead
> at a couple of hundred MB if I counted right.
>
> Not the end of the world I guess. And I agree it would make for
> simpler initial patches. OTOH, it would add more quirks to the swap
> code instead of cleaning it up. And given how common compressed memory
> setups are nowadays, it still feels like it's trading off too far in
> favor of regular swap setups at the expense of compression.

Right, I don't like adding more quirks to the swap code. I guess for
Android and ChromeOS, even though they are using compressed memory, it
is zram not zswap, so any extra overhead by swap_descs for normal swap
setups would also affect Android -- so that's something to think
about.

>
> So it wouldn't be my first preference. But it sounds workable.

If we settle on this as a first step, perhaps to avoid any ABI changes
we can have the kernel create a virtual swap device for zswap if it is
enabled, without userspace interfering or having to do swapon on a
sparse swapfile like we do today with ghost swapfiles at Google. We
can then implement indirection logic that only supports moving pages
between swap devices -- and perhaps only restrict it to only support
the virtual zswap swap device as a top tier initially.

The only user visible effect would be that if the user has zswap
enabled and did not configure a swapfile, zswap would start
compressing pages regardless, but that's what we're hoping for anyway
-- I wouldn't think this is a breaking change.

This also wouldn't be my first preference, but it seems like a smaller
step from what we have today. As long as we don't have ABI
dependencies we can always come back and change it later I suppose.