Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

Chris Li <chrisl@xxxxxxxxxx> · Tue, 5 Mar 2024 11:20:18 -0800

On Tue, Mar 5, 2024 at 2:55 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Tue, Mar 5, 2024 at 4:52 PM Chengming Zhou <chengming.zhou@xxxxxxxxx> wrote:
> >
> > Looks sensible. Now the zswap middle layer is transparent to frontend users,
> > which just allocate swap entry and swap out, don't care about whether it's
> > swapped out to the zswap or swap file.
> >
> > By decoupling, the frontend users need to know it want to allocate zswap entry
> > instead of a swap entry, right? Which becomes not transparent to users.
>
> Hmm for now, I was just thinking that it should always try zswap
> first, and only fall back to swap if it fails to store to zswap, to
> maintain the overall LRU ordering (best effort).
>
> The minimal viable implementation I'm thinking right now for this is
> basically the "ghost swapfile" approach - i.e represent zswap as a
> swapfile.

Google has been using the ghost swapfile in production for many years.
If it helps, I can rebase the ghost swap file patches to mm-unstable
then send them out for RFC discussion. I am not expecting it to merge
as it is, just as a starting point for if any one is interested in the
ghost swap file.

I think zswap with a ghost swap file will make zswap behave more like
other swap back ends. If you use the ghost swap file, migrating from
zswap to another swap device is very similar to migrating from SSD to
hard drive, for example.

> Writeback becomes quite hairy though, because there might be two
> "swap" entries of the same object (the zswap swap entry and the newly
> reserved swap entry) lying around near the end of the writeback step,
> so gotta be careful with synchronization (read: juggling the swap
> cache) to make sure concurrent swap-ins get something that makes
> sense.

Dealing with two swap device entries while writing back from one to
another is unavoidable. I consider it as necessary evil.
If we can have  swap offset lookup to different swap entry types. One
idea is to introduce a migration type of swap entry, the swap entry
will have both source and destination swap entry stored in it. Then
you just read in the source swap entry data (compressed or not). Write
to the destination entry. Every swap in of the source swap  entry will
notice it has a migration swap entry type. Then it will ask the
destination swap device to perform the IO. The same folio will exist
in both source and destination swap cache.

The limit of this approach is that, unless the source entry usage
count drops to zero (every user swap in the entry). That source swap
entry is occupied. It can't be reused for other data.

Chris