Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Wed, 1 Mar 2023 17:25:00 -0800

On Wed, Mar 1, 2023 at 4:58 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> On Tue, Feb 28, 2023 at 3:29 PM Minchan Kim <minchan@xxxxxxxxxx> wrote:
> >
> > Hi Yosry,
> >
> > On Tue, Feb 28, 2023 at 12:12:05AM -0800, Yosry Ahmed wrote:
> > > On Mon, Feb 27, 2023 at 8:54 PM Sergey Senozhatsky
> > > <senozhatsky@xxxxxxxxxxxx> wrote:
> > > >
> > > > On (23/02/18 14:38), Yosry Ahmed wrote:
> > > > [..]
> > > > > ==================== Idea ====================
> > > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > > abstraction layer between swapping implementation and the rest of MM
> > > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > > > as our abstraction layer. All MM code not concerned with swapping
> > > > > details would operate in terms of swap descs. The swap_desc can point
> > > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > > entry. It can also include all non-backend specific operations, such
> > > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > > > counting, etc. It creates a clear, nice abstraction layer between MM
> > > > > code and the actual swapping implementation.
> > > > >
> > > > > ==================== Benefits ====================
> > > > > This work enables using zswap without a backing swapfile and increases
> > > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > > a separation that allows us to skip code paths that don't make sense
> > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > > which might result in better performance (less lookups, less lock
> > > > > contention).
> > > > >
> > > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > > removing swapper address spaces, removing swap count continuation
> > > > > code, etc). Another nice cleanup that this work enables would be
> > > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > > things that are stored in page tables / caches, and for actual swap
> > > > > entries. In the future, we can potentially further optimize how we use
> > > > > the bits in the page tables instead of sticking everything into the
> > > > > current type/offset format.
> > > > >
> > > > > Another potential win here can be swapoff, which can be more practical
> > > > > by directly scanning all swap_desc's instead of going through page
> > > > > tables and shmem page caches.
> > > > >
> > > > > Overall zswap becomes more accessible and available to a wider range
> > > > > of use cases.
> > > >
> > > > I assume this also brings us closer to a proper writeback LRU handling?
> > >
> > > I assume by proper LRU handling you mean:
> > > - Swap writeback LRU that lives outside of the zpool backends (i.e in
> > > zswap itself or even outside zswap).
> >
> > Even outside zswap to support any combination on any heterogenous
> > multiple swap device configuration.
>
> Agreed, this is the end goal for the writeback LRU.
>
> >
> > The indirection layer would be essential to support it but it would
> > be also great if we don't waste any memory for the user who don't
> > want the feature.
>
> I can't currently think of a way to eliminate overhead for people only
> using swapfiles, as a lot of the core implementation changes, unless
> we want to maintain considerably more code with a lot of repeated
> functionality implemented differently. Perhaps this will change as I
> implement this, maybe things are better (or worse) than what I think
> they are, I am actively working on a proof-of-concept right now. Maybe
> a discussion in LSF/MM/BPF will help come up with optimizations as
> well :)
>
> >
> > Just FYI, there was similar discussion long time ago about the
> > indirection layer.
> > https://lore.kernel.org/linux-mm/4DA25039.3020700@xxxxxxxxxx/
>
> Yeah Hugh shared this one with me earlier, but there are a few things
> that I don't understand how they would work, at least in today's
> world.
>
> Firstly, the proposal suggests that we store a radix tree index in the
> page tables, and in the radix tree store the swap entry AND the swap
> count. I am not really sure how they would fit in 8 bytes, especially
> if we need continuation and 1 byte is not enough for the swap count.
> Continuation logic now depends on linking vmalloc'd pages using the
> lru field in struct page/folio. Perhaps we can figure out a split that
> gives enough space for swap count without continuation while also not
> limiting swapfile sizes too much.
>
> Secondly, IIUC in that proposal once we swap a page in, we free the
> swap entry and add the swapcache page to the radix tree instead. In
> that case, where does the swap count go? IIUC we still need to
> maintain it to be able to tell when all processes mapping the page
> have faulted it back, otherwise the radix tree entry is maintained
> indefinitely. We can maybe stash the swap count somewhere else in this
> case, and bring it back to the radix tree if we swap the page out
> again. Not really sure where, we can have a separate radix tree for
> swap counts when the page is in swapcache, or we can always have it in
> a separate radix tree so that the swap entry fits comfortably in the
> first radix tree.
>
> To be able to accomodate zswap in this design, I think we always need
> a separate radix tree for swap counts. In that case, one radix tree
> contains swap_entry/zswap_entry/swapcache, and the other radix tree
> contains the swap count. I think this may work, but I am not sure if
> the overhead of always doing a lookup to read the swap count is okay.
> I am also sure there would be some fun synchronization problems
> between both trees (but we already need to synchronize today between
> the swapcache and swap counts?).
>
> It sounds like it is possible to make it work. I will spend some time
> thinking about it. Having 2 radix trees also solves the 32-bit systems
> problem, but I am not sure if it's a generally better design. Radix
> trees also take up some extra space other than the entry size itself,
> so I am not sure how much memory we would end up actually saving.
>
> Johannes, I am curious if you have any thoughts about this alternative design?

I completely forgot about shadow entries here. I don't think any of
this works with shadow entries as we still need to maintain them while
the page is swapped out.