[LSF/MM/BPF TOPIC] Virtual Swap Space

Nhat Pham <nphamcs@xxxxxxxxx> · Thu, 16 Jan 2025 01:22:54 -0800

My apologies if I missed any interested party in the cc list -
hopefully the mailing lists cc's suffice :)

I'd like to (re-)propose the topic of swap abstraction layer for the
conference, as a continuation of Yosry's proposals at LSFMMBPF 2023
(see [1], [2], [3]).

(AFAICT, the same idea has been floated by Rik van Riel since at
least 2011 - see [8]).

I have a working(-ish) prototype, which hopefully will be
submission-ready soon. For now, I'd like to give the motivation/context
for the topic, as well as some high level design:

I. Motivation

Currently, when an anon page is swapped out, a slot in a backing swap
device is allocated and stored in the page table entries that refer to
the original page. This slot is also used as the "key" to find the
swapped out content, as well as the index to swap data structures, such
as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
backing slot in this way is performant and efficient when swap is purely
just disk space, and swapoff is rare.

However, the advent of many swap optimizations has exposed major
drawbacks of this design. The first problem is that we occupy a physical
slot in the swap space, even for pages that are NEVER expected to hit
the disk: pages compressed and stored in the zswap pool, zero-filled
pages, or pages rejected by both of these optimizations when zswap
writeback is disabled. This is the arguably central shortcoming of
zswap:
* In deployments when no disk space can be afforded for swap (such as
  mobile and embedded devices), users cannot adopt zswap, and are forced
  to use zram. This is confusing for users, and creates extra burdens
  for developers, having to develop and maintain similar features for
  two separate swap backends (writeback, cgroup charging, THP support,
  etc.). For instance, see the discussion in [4].
* Resource-wise, it is hugely wasteful in terms of disk usage, and
  limits the memory saving potentials of these optimizations by the
  static size of the swapfile, especially in high memory systems that
  can have up to terabytes worth of memory. It also creates significant
  challenges for users who rely on swap utilization as an early OOM
  signal.

Another motivation for a swap redesign is to simplify swapoff, which
is complicated and expensive in the current design. Tight coupling
between a swap entry and its backing storage means that it requires a
whole page table walk to update all the page table entries that refer to
this swap entry, as well as updating all the associated swap data
structures (swap cache, etc.).

II. High Level Design Overview

To fix the aforementioned issues, we need an abstraction that separates
a swap entry from its physical backing storage. IOW, we need to
“virtualize” the swap space: swap clients will work with a virtual swap
slot (that is dynamically allocated on-demand), storing it in page
table entries, and using it to index into various swap-related data
structures.

The backing storage is decoupled from this slot, and the newly
introduced layer will “resolve” the ID to the actual storage, as well
as cooperating with the swap cache to handle all the required
synchronization. This layer also manages other metadata of the swap
entry, such as its lifetime information (swap count), via a dynamically
allocated per-entry swap descriptor:

struct swp_desc {
	swp_entry_t vswap;
	union {
		swp_slot_t slot;
		struct folio *folio;
		struct zswap_entry *zswap_entry;
	};
	struct rcu_head rcu;

	rwlock_t lock;
	enum swap_type type;

#ifdef CONFIG_MEMCG
	atomic_t memcgid;
#endif

	atomic_t in_swapcache;
	struct kref refcnt;
	atomic_t swap_count;
};

This design allows us to:
* Decouple zswap (and zeromapped swap entry) from backing swapfile:
  simply associate the swap ID with one of the supported backends: a
  zswap entry, a zero-filled swap page, a slot on the swapfile, or a
  page in memory .
* Simplify and optimize swapoff: we only have to fault the page in and
  have the swap ID points to the page instead of the on-disk swap slot.
  No need to perform any page table walking :)

III. Future Use Cases

Other than decoupling swap backends and optimizing swapoff, this new
design allows us to implement the following more easily and
efficiently:

* Multi-tier swapping (as mentioned in [5]), with transparent
  transferring (promotion/demotion) of pages across tiers (see [8] and
  [9]). Similar to swapoff, with the old design we would need to
  perform the expensive page table walk.
* Swapfile compaction to alleviate fragmentation (as proposed by Ying
  Huang in [6]).
* Mixed backing THP swapin (see [7]): Once you have pinned down the
  backing store of THPs, then you can dispatch each range of subpages
  to appropriate pagein handler.

[1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@xxxxxxxxxxxxxx/
[2]: https://lwn.net/Articles/932077/
[3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
[4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@xxxxxxxxxxxxx/
[5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@xxxxxxxxxxxxxx/
[6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
[7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@xxxxxxxxxxxxxx/
[8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@xxxxxxxxxx/
[9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@xxxxxxxxxxxxxx/