[LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Sat, 18 Feb 2023 14:38:40 -0800

Hello everyone,

I would like to propose a topic for the upcoming LSF/MM/BPF in May
2023 about swap & zswap (hope I am not too late).

==================== Intro ====================
Currently, using zswap is dependent on swapfiles in an unnecessary
way. To use zswap, you need a swapfile configured (even if the space
will not be used) and zswap is restricted by its size. When pages
reside in zswap, the corresponding swap entry in the swapfile cannot
be used, and is essentially wasted. We also go through unnecessary
code paths when using zswap, such as finding and allocating a swap
entry on the swapout path, or readahead in the swapin path. I am
proposing a swapping abstraction layer that would allow us to remove
zswap's dependency on swapfiles. This can be done by introducing a
data structure between the actual swapping implementation (swapfiles,
zswap) and the rest of the MM code.

==================== Objective ====================
Enabling the use of zswap without a backing swapfile, which makes
zswap useful for a wider variety of use cases. Also, when zswap is
used with a swapfile, the pages in zswap do not use up space in the
swapfile, so the overall swapping capacity increases.

==================== Idea ====================
Introduce a data structure, which I currently call a swap_desc, as an
abstraction layer between swapping implementation and the rest of MM
code. Page tables & page caches would store a swap id (encoded as a
swp_entry_t) instead of directly storing the swap entry associated
with the swapfile. This swap id maps to a struct swap_desc, which acts
as our abstraction layer. All MM code not concerned with swapping
details would operate in terms of swap descs. The swap_desc can point
to either a normal swap entry (associated with a swapfile) or a zswap
entry. It can also include all non-backend specific operations, such
as the swapcache (which would be a simple pointer in swap_desc), swap
counting, etc. It creates a clear, nice abstraction layer between MM
code and the actual swapping implementation.

==================== Benefits ====================
This work enables using zswap without a backing swapfile and increases
the swap capacity when zswap is used with a swapfile. It also creates
a separation that allows us to skip code paths that don't make sense
in the zswap path (e.g. readahead). We get to drop zswap's rbtree
which might result in better performance (less lookups, less lock
contention).

The abstraction layer also opens the door for multiple cleanups (e.g.
removing swapper address spaces, removing swap count continuation
code, etc). Another nice cleanup that this work enables would be
separating the overloaded swp_entry_t into two distinct types: one for
things that are stored in page tables / caches, and for actual swap
entries. In the future, we can potentially further optimize how we use
the bits in the page tables instead of sticking everything into the
current type/offset format.

Another potential win here can be swapoff, which can be more practical
by directly scanning all swap_desc's instead of going through page
tables and shmem page caches.

Overall zswap becomes more accessible and available to a wider range
of use cases.

==================== Cost ====================
The obvious downside of this is added memory overhead, specifically
for users that use swapfiles without zswap. Instead of paying one byte
(swap_map) for every potential page in the swapfile (+ swap count
continuation), we pay the size of the swap_desc for every page that is
actually in the swapfile, which I am estimating can be roughly around
24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
scales with pages actually swapped out. For zswap users, it should be
a win (or at least even) because we get to drop a lot of fields from
struct zswap_entry (e.g. rbtree, index, etc).

Another potential concern is readahead. With this design, we have no
way to get a swap_desc given a swap entry (type & offset). We would
need to maintain a reverse mapping, adding a little bit more overhead,
or search all swapped out pages instead :). A reverse mapping might
pump the per-swapped page overhead to ~32 bytes (~0.8% of swapped out
memory).

==================== Bottom Line ====================
It would be nice to discuss the potential here and the tradeoffs. I
know that other folks using zswap (or interested in using it) may find
this very useful. I am sure I am missing some context on why things
are the way they are, and perhaps some obvious holes in my story.
Looking forward to discussing this with anyone interested :)

I think Johannes may be interested in attending this discussion, since
a lot of ideas here are inspired by discussions I had with him :)