On Thu, Jan 16, 2025 at 1:22 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > My apologies if I missed any interested party in the cc list - > hopefully the mailing lists cc's suffice :) > > I'd like to (re-)propose the topic of swap abstraction layer for the > conference, as a continuation of Yosry's proposals at LSFMMBPF 2023 > (see [1], [2], [3]). > > (AFAICT, the same idea has been floated by Rik van Riel since at > least 2011 - see [8]). > > I have a working(-ish) prototype, which hopefully will be > submission-ready soon. For now, I'd like to give the motivation/context > for the topic, as well as some high level design: I would obviously be interested in attending this, albeit virtually if possible. Just sharing some random thoughts below from my cold cache. > > I. Motivation > > Currently, when an anon page is swapped out, a slot in a backing swap > device is allocated and stored in the page table entries that refer to > the original page. This slot is also used as the "key" to find the > swapped out content, as well as the index to swap data structures, such > as the swap cache, or the swap cgroup mapping. Tying a swap entry to its > backing slot in this way is performant and efficient when swap is purely > just disk space, and swapoff is rare. > > However, the advent of many swap optimizations has exposed major > drawbacks of this design. The first problem is that we occupy a physical > slot in the swap space, even for pages that are NEVER expected to hit > the disk: pages compressed and stored in the zswap pool, zero-filled > pages, or pages rejected by both of these optimizations when zswap > writeback is disabled. This is the arguably central shortcoming of > zswap: > * In deployments when no disk space can be afforded for swap (such as > mobile and embedded devices), users cannot adopt zswap, and are forced > to use zram. This is confusing for users, and creates extra burdens > for developers, having to develop and maintain similar features for > two separate swap backends (writeback, cgroup charging, THP support, > etc.). For instance, see the discussion in [4]. > * Resource-wise, it is hugely wasteful in terms of disk usage, and > limits the memory saving potentials of these optimizations by the > static size of the swapfile, especially in high memory systems that > can have up to terabytes worth of memory. It also creates significant > challenges for users who rely on swap utilization as an early OOM > signal. > > Another motivation for a swap redesign is to simplify swapoff, which > is complicated and expensive in the current design. Tight coupling > between a swap entry and its backing storage means that it requires a > whole page table walk to update all the page table entries that refer to > this swap entry, as well as updating all the associated swap data > structures (swap cache, etc.). > > > II. High Level Design Overview > > To fix the aforementioned issues, we need an abstraction that separates > a swap entry from its physical backing storage. IOW, we need to > “virtualize” the swap space: swap clients will work with a virtual swap > slot (that is dynamically allocated on-demand), storing it in page > table entries, and using it to index into various swap-related data > structures. > > The backing storage is decoupled from this slot, and the newly > introduced layer will “resolve” the ID to the actual storage, as well > as cooperating with the swap cache to handle all the required > synchronization. This layer also manages other metadata of the swap > entry, such as its lifetime information (swap count), via a dynamically > allocated per-entry swap descriptor: Do you plan to allocate one per-folio or per-page? I suppose it's per-page based on the design, but I am wondering if you explored having it per-folio. To make it work we'd need to support splitting a swp_desc, and figuring out which slot or zswap_entry corresponds to a certain page in a folio. > > struct swp_desc { > swp_entry_t vswap; > union { > swp_slot_t slot; > struct folio *folio; > struct zswap_entry *zswap_entry; > }; > struct rcu_head rcu; > > rwlock_t lock; > enum swap_type type; > > #ifdef CONFIG_MEMCG > atomic_t memcgid; > #endif > > atomic_t in_swapcache; > struct kref refcnt; > atomic_t swap_count; > }; That seems a bit large. I am assuming this is for the purpose of the prototype and we can reduce its size eventually, right? Particularly, I remember looking into merging the swap_count and refcnt, and I am not sure what in_swapcache is (is this a bit? Why can't we use a bit from swap_count?). I also think we can shove the swap_type in the low bits of the pointers (with some finesse for swp_slot_t), and the locking could be made less granular (I remember exploring going completely lockless, but I don't remember how that turned out). > > > This design allows us to: > * Decouple zswap (and zeromapped swap entry) from backing swapfile: > simply associate the swap ID with one of the supported backends: a > zswap entry, a zero-filled swap page, a slot on the swapfile, or a > page in memory . > * Simplify and optimize swapoff: we only have to fault the page in and > have the swap ID points to the page instead of the on-disk swap slot. > No need to perform any page table walking :) It also allows us to delete the complex swap count continuation code. > > III. Future Use Cases > > Other than decoupling swap backends and optimizing swapoff, this new > design allows us to implement the following more easily and > efficiently: > > * Multi-tier swapping (as mentioned in [5]), with transparent > transferring (promotion/demotion) of pages across tiers (see [8] and > [9]). Similar to swapoff, with the old design we would need to > perform the expensive page table walk. > * Swapfile compaction to alleviate fragmentation (as proposed by Ying > Huang in [6]). > * Mixed backing THP swapin (see [7]): Once you have pinned down the > backing store of THPs, then you can dispatch each range of subpages > to appropriate pagein handler. > > [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@xxxxxxxxxxxxxx/ > [2]: https://lwn.net/Articles/932077/ > [3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg > [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@xxxxxxxxxxxxx/ > [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@xxxxxxxxxxxxxx/ > [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@xxxxxxxxxxxxxx/ > [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@xxxxxxxxxx/ > [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@xxxxxxxxxxxxxx/