Re: [LSF/MM/BPF TOPIC] Virtual Swap Space

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Fri, 17 Jan 2025 08:51:41 -0800

On Thu, Jan 16, 2025 at 6:47 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Fri, Jan 17, 2025 at 1:48 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> >
> > On Thu, Jan 16, 2025 at 1:22 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> > >
> > > My apologies if I missed any interested party in the cc list -
> > > hopefully the mailing lists cc's suffice :)
> > >
> > > I'd like to (re-)propose the topic of swap abstraction layer for the
> > > conference, as a continuation of Yosry's proposals at LSFMMBPF 2023
> > > (see [1], [2], [3]).
> > >
> > > (AFAICT, the same idea has been floated by Rik van Riel since at
> > > least 2011 - see [8]).
> > >
> > > I have a working(-ish) prototype, which hopefully will be
> > > submission-ready soon. For now, I'd like to give the motivation/context
> > > for the topic, as well as some high level design:
> >
> > I would obviously be interested in attending this, albeit virtually if
> > possible. Just sharing some random thoughts below from my cold cache.
>
> Your inputs are always appreciated :)
>
> >
> > >
> > > I. Motivation
> > >
> > > Currently, when an anon page is swapped out, a slot in a backing swap
> > > device is allocated and stored in the page table entries that refer to
> > > the original page. This slot is also used as the "key" to find the
> > > swapped out content, as well as the index to swap data structures, such
> > > as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> > > backing slot in this way is performant and efficient when swap is purely
> > > just disk space, and swapoff is rare.
> > >
> > > However, the advent of many swap optimizations has exposed major
> > > drawbacks of this design. The first problem is that we occupy a physical
> > > slot in the swap space, even for pages that are NEVER expected to hit
> > > the disk: pages compressed and stored in the zswap pool, zero-filled
> > > pages, or pages rejected by both of these optimizations when zswap
> > > writeback is disabled. This is the arguably central shortcoming of
> > > zswap:
> > > * In deployments when no disk space can be afforded for swap (such as
> > >   mobile and embedded devices), users cannot adopt zswap, and are forced
> > >   to use zram. This is confusing for users, and creates extra burdens
> > >   for developers, having to develop and maintain similar features for
> > >   two separate swap backends (writeback, cgroup charging, THP support,
> > >   etc.). For instance, see the discussion in [4].
> > > * Resource-wise, it is hugely wasteful in terms of disk usage, and
> > >   limits the memory saving potentials of these optimizations by the
> > >   static size of the swapfile, especially in high memory systems that
> > >   can have up to terabytes worth of memory. It also creates significant
> > >   challenges for users who rely on swap utilization as an early OOM
> > >   signal.
> > >
> > > Another motivation for a swap redesign is to simplify swapoff, which
> > > is complicated and expensive in the current design. Tight coupling
> > > between a swap entry and its backing storage means that it requires a
> > > whole page table walk to update all the page table entries that refer to
> > > this swap entry, as well as updating all the associated swap data
> > > structures (swap cache, etc.).
> > >
> > >
> > > II. High Level Design Overview
> > >
> > > To fix the aforementioned issues, we need an abstraction that separates
> > > a swap entry from its physical backing storage. IOW, we need to
> > > “virtualize” the swap space: swap clients will work with a virtual swap
> > > slot (that is dynamically allocated on-demand), storing it in page
> > > table entries, and using it to index into various swap-related data
> > > structures.
> > >
> > > The backing storage is decoupled from this slot, and the newly
> > > introduced layer will “resolve” the ID to the actual storage, as well
> > > as cooperating with the swap cache to handle all the required
> > > synchronization. This layer also manages other metadata of the swap
> > > entry, such as its lifetime information (swap count), via a dynamically
> > > allocated per-entry swap descriptor:
> >
> > Do you plan to allocate one per-folio or per-page? I suppose it's
> > per-page based on the design, but I am wondering if you explored
> > having it per-folio. To make it work we'd need to support splitting a
> > swp_desc, and figuring out which slot or zswap_entry corresponds to a
> > certain page in a folio
>
> Per-page, for now. Per-folio requires allocating these swp_descs on
> huge page splitting etc., which is more complex.

We'd also need to allocate them during swapin. If a folio is swapped
out as a 16K chunk with a single swp_desc, then we try to swapin one
4K in the middle, we may need to split the swp_desc into 2.

>
> And yeah, we need to chain these zswap_entry's somehow. Not impossible
> certainly, but more overhead and more complexity :)
>
> >
> > >
> > > struct swp_desc {
> > >         swp_entry_t vswap;
> > >         union {
> > >                 swp_slot_t slot;
> > >                 struct folio *folio;
> > >                 struct zswap_entry *zswap_entry;
> > >         };
> > >         struct rcu_head rcu;
> > >
> > >         rwlock_t lock;
> > >         enum swap_type type;
> > >
> > > #ifdef CONFIG_MEMCG
> > >         atomic_t memcgid;
> > > #endif
> > >
> > >         atomic_t in_swapcache;
> > >         struct kref refcnt;
> > >         atomic_t swap_count;
> > > };
> >
> > That seems a bit large. I am assuming this is for the purpose of the
> > prototype and we can reduce its size eventually, right?
>
> Yup. I copied and pasted this from the prototype. Originally I
> squeezed all the state (in_swapcache and the swap type) in an
> integer-type "flag" field + 1 separate swap count field, and protected
> them all with a single rw lock. That gets really ugly/confusing, so
> for the sake of the prototype I just separate them all out in their
> own fields, and play with atomicity to see if it's possible to do
> things locklessly. So far so good (i.e no crashes yet), but the final
> form is TBD :) Maybe we can discuss in closer details once I send out
> this prototype as an RFC?

Yeah, I just had some passing comments.

>
> (I will say though it looks cleaner when all these fields are
> separated. So it's going to be a tradeoff in that sense too).

It's a tradeoff but I think we should be able to hide a lot of the
complexity behind neat helpers. It's not pretty but I think the memory
overhead is an important factor here.

>
> >
> > Particularly, I remember looking into merging the swap_count and
> > refcnt, and I am not sure what in_swapcache is (is this a bit? Why
> > can't we use a bit from swap_count?).
>
> Yup. That's a single bit - it's a (partial) replacement for
> SWAP_HAS_CACHE state in the existing swap map.
>
> No particular reason why we can't squeeze it into swap counts other
> than clarity :) It's going to be a bit annoying working with swap
> count values (swap count increment is now * 2 instead of ++ etc.).

Nothing a nice helper cannot hide :)