Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap

Minchan Kim <minchan@xxxxxxxxxx> · Thu, 2 Mar 2023 16:33:28 -0800

On Wed, Mar 01, 2023 at 04:30:22PM -0800, Yosry Ahmed wrote:
> On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@xxxxxxxxxx> wrote:
> >
> > Hi Yosry,
> >
> > On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> > > Hello everyone,
> > >
> > > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > > 2023 about swap & zswap (hope I am not too late).
> >
> > I am very interested in participating in this discussion as well.
> 
> That's great to hear!
> 
> >
> > > ==================== Objective ====================
> > > Enabling the use of zswap without a backing swapfile, which makes
> > > zswap useful for a wider variety of use cases. Also, when zswap is
> > > used with a swapfile, the pages in zswap do not use up space in the
> > > swapfile, so the overall swapping capacity increases.
> >
> > Agree.
> >
> > >
> > > ==================== Idea ====================
> > > Introduce a data structure, which I currently call a swap_desc, as an
> > > abstraction layer between swapping implementation and the rest of MM
> > > code. Page tables & page caches would store a swap id (encoded as a
> > > swp_entry_t) instead of directly storing the swap entry associated
> > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> >
> > Can you provide a bit more detail? I am curious how this swap id
> > maps into the swap_desc? Is the swp_entry_t cast into "struct
> > swap_desc*" or going through some lookup table/tree?
> 
> swap id would be an index in a radix tree (aka xarray), which contains
> a pointer to the swap_desc struct. This lookup should be free with
> this design as we also use swap_desc to directly store the swap cache
> pointer, so this lookup essentially replaces the swap cache lookup.
> 
> >
> > > as our abstraction layer. All MM code not concerned with swapping
> > > details would operate in terms of swap descs. The swap_desc can point
> > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > entry. It can also include all non-backend specific operations, such
> > > as the swapcache (which would be a simple pointer in swap_desc), swap
> >
> > Does the zswap entry still use the swap slot cache and swap_info_struct?
> 
> In this design no, it shouldn't.
> 
> >
> > > This work enables using zswap without a backing swapfile and increases
> > > the swap capacity when zswap is used with a swapfile. It also creates
> > > a separation that allows us to skip code paths that don't make sense
> > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > which might result in better performance (less lookups, less lock
> > > contention).
> > >
> > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > removing swapper address spaces, removing swap count continuation
> > > code, etc). Another nice cleanup that this work enables would be
> > > separating the overloaded swp_entry_t into two distinct types: one for
> > > things that are stored in page tables / caches, and for actual swap
> > > entries. In the future, we can potentially further optimize how we use
> > > the bits in the page tables instead of sticking everything into the
> > > current type/offset format.
> >
> > Looking forward to seeing more details in the upcoming discussion.
> > >
> > > ==================== Cost ====================
> > > The obvious downside of this is added memory overhead, specifically
> > > for users that use swapfiles without zswap. Instead of paying one byte
> > > (swap_map) for every potential page in the swapfile (+ swap count
> > > continuation), we pay the size of the swap_desc for every page that is
> > > actually in the swapfile, which I am estimating can be roughly around
> > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > > scales with pages actually swapped out. For zswap users, it should be
> >
> > Is there a way to avoid turning 1 byte into 24 byte per swapped
> > pages? For the users that use swap but no zswap, this is pure overhead.
> 
> That's what I could think of at this point. My idea was something like this:
> 
> struct swap_desc {
>     union { /* Use one bit to distinguish them */
>         swp_entry_t swap_entry;
>         struct zswap_entry *zswap_entry;
>     };
>     struct folio *swapcache;
>     atomic_t swap_count;
>     u32 id;
> }
> 
> Having the id in the swap_desc is convenient as we can directly map
> the swap_desc to a swp_entry_t to place in the page tables, but I
> don't think it's necessary. Without it, the struct size is 20 bytes,
> so I think the extra 4 bytes are okay to use anyway if the slab
> allocator only allocates multiples of 8 bytes.
> 
> The idea here is to unify the swapcache and swap_count implementation
> between different swap backends (swapfiles, zswap, etc), which would
> create a better abstraction and reduce reinventing the wheel.
> 
> We can reduce to only 8 bytes and only store the swap/zswap entry, but
> we still need the swap cache anyway so might as well just store the
> pointer in the struct and have a unified lookup-free swapcache, so
> really 16 bytes is the minimum.
> 
> If we stop at 16 bytes, then we need to handle swap count separately
> in swapfiles and zswap. This is not the end of the world, but are the
> 8 bytes worth this?
> 
> Keep in mind that the current overhead is 1 byte O(max swap pages) not
> O(swapped). Also, 1 byte is assuming we do not use the swap

Just to share info:

Android usually used swap space fully most of times via Compacting
background Apps so O(swapped) ~= O(max swap pages).