Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Wed, 1 Mar 2023 16:58:08 -0800

On Tue, Feb 28, 2023 at 3:29 PM Minchan Kim <minchan@xxxxxxxxxx> wrote:
>
> Hi Yosry,
>
> On Tue, Feb 28, 2023 at 12:12:05AM -0800, Yosry Ahmed wrote:
> > On Mon, Feb 27, 2023 at 8:54 PM Sergey Senozhatsky
> > <senozhatsky@xxxxxxxxxxxx> wrote:
> > >
> > > On (23/02/18 14:38), Yosry Ahmed wrote:
> > > [..]
> > > > ==================== Idea ====================
> > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > abstraction layer between swapping implementation and the rest of MM
> > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > > as our abstraction layer. All MM code not concerned with swapping
> > > > details would operate in terms of swap descs. The swap_desc can point
> > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > entry. It can also include all non-backend specific operations, such
> > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > > counting, etc. It creates a clear, nice abstraction layer between MM
> > > > code and the actual swapping implementation.
> > > >
> > > > ==================== Benefits ====================
> > > > This work enables using zswap without a backing swapfile and increases
> > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > a separation that allows us to skip code paths that don't make sense
> > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > which might result in better performance (less lookups, less lock
> > > > contention).
> > > >
> > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > removing swapper address spaces, removing swap count continuation
> > > > code, etc). Another nice cleanup that this work enables would be
> > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > things that are stored in page tables / caches, and for actual swap
> > > > entries. In the future, we can potentially further optimize how we use
> > > > the bits in the page tables instead of sticking everything into the
> > > > current type/offset format.
> > > >
> > > > Another potential win here can be swapoff, which can be more practical
> > > > by directly scanning all swap_desc's instead of going through page
> > > > tables and shmem page caches.
> > > >
> > > > Overall zswap becomes more accessible and available to a wider range
> > > > of use cases.
> > >
> > > I assume this also brings us closer to a proper writeback LRU handling?
> >
> > I assume by proper LRU handling you mean:
> > - Swap writeback LRU that lives outside of the zpool backends (i.e in
> > zswap itself or even outside zswap).
>
> Even outside zswap to support any combination on any heterogenous
> multiple swap device configuration.

Agreed, this is the end goal for the writeback LRU.

>
> The indirection layer would be essential to support it but it would
> be also great if we don't waste any memory for the user who don't
> want the feature.

I can't currently think of a way to eliminate overhead for people only
using swapfiles, as a lot of the core implementation changes, unless
we want to maintain considerably more code with a lot of repeated
functionality implemented differently. Perhaps this will change as I
implement this, maybe things are better (or worse) than what I think
they are, I am actively working on a proof-of-concept right now. Maybe
a discussion in LSF/MM/BPF will help come up with optimizations as
well :)

>
> Just FYI, there was similar discussion long time ago about the
> indirection layer.
> https://lore.kernel.org/linux-mm/4DA25039.3020700@xxxxxxxxxx/

Yeah Hugh shared this one with me earlier, but there are a few things
that I don't understand how they would work, at least in today's
world.

Firstly, the proposal suggests that we store a radix tree index in the
page tables, and in the radix tree store the swap entry AND the swap
count. I am not really sure how they would fit in 8 bytes, especially
if we need continuation and 1 byte is not enough for the swap count.
Continuation logic now depends on linking vmalloc'd pages using the
lru field in struct page/folio. Perhaps we can figure out a split that
gives enough space for swap count without continuation while also not
limiting swapfile sizes too much.

Secondly, IIUC in that proposal once we swap a page in, we free the
swap entry and add the swapcache page to the radix tree instead. In
that case, where does the swap count go? IIUC we still need to
maintain it to be able to tell when all processes mapping the page
have faulted it back, otherwise the radix tree entry is maintained
indefinitely. We can maybe stash the swap count somewhere else in this
case, and bring it back to the radix tree if we swap the page out
again. Not really sure where, we can have a separate radix tree for
swap counts when the page is in swapcache, or we can always have it in
a separate radix tree so that the swap entry fits comfortably in the
first radix tree.

To be able to accomodate zswap in this design, I think we always need
a separate radix tree for swap counts. In that case, one radix tree
contains swap_entry/zswap_entry/swapcache, and the other radix tree
contains the swap count. I think this may work, but I am not sure if
the overhead of always doing a lookup to read the swap count is okay.
I am also sure there would be some fun synchronization problems
between both trees (but we already need to synchronize today between
the swapcache and swap counts?).

It sounds like it is possible to make it work. I will spend some time
thinking about it. Having 2 radix trees also solves the 32-bit systems
problem, but I am not sure if it's a generally better design. Radix
trees also take up some extra space other than the entry size itself,
so I am not sure how much memory we would end up actually saving.

Johannes, I am curious if you have any thoughts about this alternative design?