Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Sun, 19 Feb 2023 01:34:26 -0800

On Sat, Feb 18, 2023 at 8:31 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> > Hello everyone,
> >
> > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > 2023 about swap & zswap (hope I am not too late).
>
> Submissions are due March 1st, I believe, so not too late.
>
> > ==================== Bottom Line ====================
> > It would be nice to discuss the potential here and the tradeoffs. I
> > know that other folks using zswap (or interested in using it) may find
> > this very useful. I am sure I am missing some context on why things
> > are the way they are, and perhaps some obvious holes in my story.
> > Looking forward to discussing this with anyone interested :)
> >
> > I think Johannes may be interested in attending this discussion, since
> > a lot of ideas here are inspired by discussions I had with him :)
>
> I think an overhaul of the swap code is long overdue.  I appreciate
> you're very much focused on zswap, but there are many other problems.

Fully agree. I spent more time than I care to admit just figuring out
the difference between all the functions that have "swap" and "free"
in their names :/

I cannot claim that I am trying to do that, like you said
I am focused on zswap, but we can discuss the direction that swap
should head in, and where zswap would fit in the picture. We can at
least make sure that this zswap work would be aligned with any future
plans for swap, so that we don't step on each other's toes.

> For example, swap does not work on zoned devices.  Swap readahead is
> generally physical (ie optimised for spinning discs) rather than logical
> (more appropriate for SSDs).  Swap's management of free space is crude

We have swap_vma_readahead() which should be on by default for anon
memory on non-rotating devices, but it's only for anon. shmem only
uses swap_cluster_readahead(), which I am not sure if it makes sense
for all cases, especially zswap.

> compared to real filesystems.  The way that swap bypasses the filesystem
> when writing to swap files is awful.  I haven't even started to look at
> what changes need to be made to swap in order to swap out arbitrary-order
> folios (instead of PMD-sized + PTE-sized).

I don't know a lot about file systems so I can't chip in here.

>
> I'm probably not a great person to participate in the design of a
> replacement system.  I don't know nearly enough about anonymous memory.

Any input would be helpful, I am sure you know more than I do :)

> I'd be sitting in the back shouting unhelpful things like, "Can't you
> see an anon_vma is the exact same thing as an inode?"  and "Why don't
> we steal the block allocation functions from XFS?"  and "Why do tmpfs
> pages have to move to the swap cache; can't we just leave them in the
> page cache and pass them to the swap code directly?"

For that last one at least, the proposed design makes the swap cache
much less similar to the page cache, so at least we can stop worrying
about whether we really need to use the swap cache for tmpfs ;)

>
> Maybe Neil Brown or Huang Ying would be good participants, although
> I don't recall seeing either of them at an LSFMM recently.

Looking forward to talking about this to everyone who's interested :)