My apologies if I missed any interested party in the cc list - hopefully the mailing lists cc's suffice :) I'd like to (re-)propose the topic of swap abstraction layer for the conference, as a continuation of Yosry's proposals at LSFMMBPF 2023 (see [1], [2], [3]). (AFAICT, the same idea has been floated by Rik van Riel since at least 2011 - see [8]). I have a working(-ish) prototype, which hopefully will be submission-ready soon. For now, I'd like to give the motivation/context for the topic, as well as some high level design: I. Motivation Currently, when an anon page is swapped out, a slot in a backing swap device is allocated and stored in the page table entries that refer to the original page. This slot is also used as the "key" to find the swapped out content, as well as the index to swap data structures, such as the swap cache, or the swap cgroup mapping. Tying a swap entry to its backing slot in this way is performant and efficient when swap is purely just disk space, and swapoff is rare. However, the advent of many swap optimizations has exposed major drawbacks of this design. The first problem is that we occupy a physical slot in the swap space, even for pages that are NEVER expected to hit the disk: pages compressed and stored in the zswap pool, zero-filled pages, or pages rejected by both of these optimizations when zswap writeback is disabled. This is the arguably central shortcoming of zswap: * In deployments when no disk space can be afforded for swap (such as mobile and embedded devices), users cannot adopt zswap, and are forced to use zram. This is confusing for users, and creates extra burdens for developers, having to develop and maintain similar features for two separate swap backends (writeback, cgroup charging, THP support, etc.). For instance, see the discussion in [4]. * Resource-wise, it is hugely wasteful in terms of disk usage, and limits the memory saving potentials of these optimizations by the static size of the swapfile, especially in high memory systems that can have up to terabytes worth of memory. It also creates significant challenges for users who rely on swap utilization as an early OOM signal. Another motivation for a swap redesign is to simplify swapoff, which is complicated and expensive in the current design. Tight coupling between a swap entry and its backing storage means that it requires a whole page table walk to update all the page table entries that refer to this swap entry, as well as updating all the associated swap data structures (swap cache, etc.). II. High Level Design Overview To fix the aforementioned issues, we need an abstraction that separates a swap entry from its physical backing storage. IOW, we need to “virtualize” the swap space: swap clients will work with a virtual swap slot (that is dynamically allocated on-demand), storing it in page table entries, and using it to index into various swap-related data structures. The backing storage is decoupled from this slot, and the newly introduced layer will “resolve” the ID to the actual storage, as well as cooperating with the swap cache to handle all the required synchronization. This layer also manages other metadata of the swap entry, such as its lifetime information (swap count), via a dynamically allocated per-entry swap descriptor: struct swp_desc { swp_entry_t vswap; union { swp_slot_t slot; struct folio *folio; struct zswap_entry *zswap_entry; }; struct rcu_head rcu; rwlock_t lock; enum swap_type type; #ifdef CONFIG_MEMCG atomic_t memcgid; #endif atomic_t in_swapcache; struct kref refcnt; atomic_t swap_count; }; This design allows us to: * Decouple zswap (and zeromapped swap entry) from backing swapfile: simply associate the swap ID with one of the supported backends: a zswap entry, a zero-filled swap page, a slot on the swapfile, or a page in memory . * Simplify and optimize swapoff: we only have to fault the page in and have the swap ID points to the page instead of the on-disk swap slot. No need to perform any page table walking :) III. Future Use Cases Other than decoupling swap backends and optimizing swapoff, this new design allows us to implement the following more easily and efficiently: * Multi-tier swapping (as mentioned in [5]), with transparent transferring (promotion/demotion) of pages across tiers (see [8] and [9]). Similar to swapoff, with the old design we would need to perform the expensive page table walk. * Swapfile compaction to alleviate fragmentation (as proposed by Ying Huang in [6]). * Mixed backing THP swapin (see [7]): Once you have pinned down the backing store of THPs, then you can dispatch each range of subpages to appropriate pagein handler. [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@xxxxxxxxxxxxxx/ [2]: https://lwn.net/Articles/932077/ [3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@xxxxxxxxxxxxx/ [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@xxxxxxxxxxxxxx/ [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@xxxxxxxxxxxxxx/ [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@xxxxxxxxxx/ [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@xxxxxxxxxxxxxx/