Re: [LSF/MM/BPF TOPIC] Virtual Swap Space

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Thu, 16 Jan 2025 10:47:39 -0800

On Thu, Jan 16, 2025 at 1:22 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> My apologies if I missed any interested party in the cc list -
> hopefully the mailing lists cc's suffice :)
>
> I'd like to (re-)propose the topic of swap abstraction layer for the
> conference, as a continuation of Yosry's proposals at LSFMMBPF 2023
> (see [1], [2], [3]).
>
> (AFAICT, the same idea has been floated by Rik van Riel since at
> least 2011 - see [8]).
>
> I have a working(-ish) prototype, which hopefully will be
> submission-ready soon. For now, I'd like to give the motivation/context
> for the topic, as well as some high level design:

I would obviously be interested in attending this, albeit virtually if
possible. Just sharing some random thoughts below from my cold cache.

>
> I. Motivation
>
> Currently, when an anon page is swapped out, a slot in a backing swap
> device is allocated and stored in the page table entries that refer to
> the original page. This slot is also used as the "key" to find the
> swapped out content, as well as the index to swap data structures, such
> as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> backing slot in this way is performant and efficient when swap is purely
> just disk space, and swapoff is rare.
>
> However, the advent of many swap optimizations has exposed major
> drawbacks of this design. The first problem is that we occupy a physical
> slot in the swap space, even for pages that are NEVER expected to hit
> the disk: pages compressed and stored in the zswap pool, zero-filled
> pages, or pages rejected by both of these optimizations when zswap
> writeback is disabled. This is the arguably central shortcoming of
> zswap:
> * In deployments when no disk space can be afforded for swap (such as
>   mobile and embedded devices), users cannot adopt zswap, and are forced
>   to use zram. This is confusing for users, and creates extra burdens
>   for developers, having to develop and maintain similar features for
>   two separate swap backends (writeback, cgroup charging, THP support,
>   etc.). For instance, see the discussion in [4].
> * Resource-wise, it is hugely wasteful in terms of disk usage, and
>   limits the memory saving potentials of these optimizations by the
>   static size of the swapfile, especially in high memory systems that
>   can have up to terabytes worth of memory. It also creates significant
>   challenges for users who rely on swap utilization as an early OOM
>   signal.
>
> Another motivation for a swap redesign is to simplify swapoff, which
> is complicated and expensive in the current design. Tight coupling
> between a swap entry and its backing storage means that it requires a
> whole page table walk to update all the page table entries that refer to
> this swap entry, as well as updating all the associated swap data
> structures (swap cache, etc.).
>
>
> II. High Level Design Overview
>
> To fix the aforementioned issues, we need an abstraction that separates
> a swap entry from its physical backing storage. IOW, we need to
> “virtualize” the swap space: swap clients will work with a virtual swap
> slot (that is dynamically allocated on-demand), storing it in page
> table entries, and using it to index into various swap-related data
> structures.
>
> The backing storage is decoupled from this slot, and the newly
> introduced layer will “resolve” the ID to the actual storage, as well
> as cooperating with the swap cache to handle all the required
> synchronization. This layer also manages other metadata of the swap
> entry, such as its lifetime information (swap count), via a dynamically
> allocated per-entry swap descriptor:

Do you plan to allocate one per-folio or per-page? I suppose it's
per-page based on the design, but I am wondering if you explored
having it per-folio. To make it work we'd need to support splitting a
swp_desc, and figuring out which slot or zswap_entry corresponds to a
certain page in a folio.

>
> struct swp_desc {
>         swp_entry_t vswap;
>         union {
>                 swp_slot_t slot;
>                 struct folio *folio;
>                 struct zswap_entry *zswap_entry;
>         };
>         struct rcu_head rcu;
>
>         rwlock_t lock;
>         enum swap_type type;
>
> #ifdef CONFIG_MEMCG
>         atomic_t memcgid;
> #endif
>
>         atomic_t in_swapcache;
>         struct kref refcnt;
>         atomic_t swap_count;
> };

That seems a bit large. I am assuming this is for the purpose of the
prototype and we can reduce its size eventually, right?

Particularly, I remember looking into merging the swap_count and
refcnt, and I am not sure what in_swapcache is (is this a bit? Why
can't we use a bit from swap_count?).

I also think we can shove the swap_type in the low bits of the
pointers (with some finesse for swp_slot_t), and the locking could be
made less granular (I remember exploring going completely lockless,
but I don't remember how that turned out).

>
>
> This design allows us to:
> * Decouple zswap (and zeromapped swap entry) from backing swapfile:
>   simply associate the swap ID with one of the supported backends: a
>   zswap entry, a zero-filled swap page, a slot on the swapfile, or a
>   page in memory .
> * Simplify and optimize swapoff: we only have to fault the page in and
>   have the swap ID points to the page instead of the on-disk swap slot.
>   No need to perform any page table walking :)

It also allows us to delete the complex swap count continuation code.

>
> III. Future Use Cases
>
> Other than decoupling swap backends and optimizing swapoff, this new
> design allows us to implement the following more easily and
> efficiently:
>
> * Multi-tier swapping (as mentioned in [5]), with transparent
>   transferring (promotion/demotion) of pages across tiers (see [8] and
>   [9]). Similar to swapoff, with the old design we would need to
>   perform the expensive page table walk.
> * Swapfile compaction to alleviate fragmentation (as proposed by Ying
>   Huang in [6]).
> * Mixed backing THP swapin (see [7]): Once you have pinned down the
>   backing store of THPs, then you can dispatch each range of subpages
>   to appropriate pagein handler.
>
> [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@xxxxxxxxxxxxxx/
> [2]: https://lwn.net/Articles/932077/
> [3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
> [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@xxxxxxxxxxxxx/
> [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@xxxxxxxxxxxxxx/
> [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
> [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@xxxxxxxxxxxxxx/
> [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@xxxxxxxxxx/
> [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@xxxxxxxxxxxxxx/