On Thu, Jan 16, 2025 at 6:47 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > On Fri, Jan 17, 2025 at 1:48 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > On Thu, Jan 16, 2025 at 1:22 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > > > > > My apologies if I missed any interested party in the cc list - > > > hopefully the mailing lists cc's suffice :) > > > > > > I'd like to (re-)propose the topic of swap abstraction layer for the > > > conference, as a continuation of Yosry's proposals at LSFMMBPF 2023 > > > (see [1], [2], [3]). > > > > > > (AFAICT, the same idea has been floated by Rik van Riel since at > > > least 2011 - see [8]). > > > > > > I have a working(-ish) prototype, which hopefully will be > > > submission-ready soon. For now, I'd like to give the motivation/context > > > for the topic, as well as some high level design: > > > > I would obviously be interested in attending this, albeit virtually if > > possible. Just sharing some random thoughts below from my cold cache. > > Your inputs are always appreciated :) > > > > > > > > > I. Motivation > > > > > > Currently, when an anon page is swapped out, a slot in a backing swap > > > device is allocated and stored in the page table entries that refer to > > > the original page. This slot is also used as the "key" to find the > > > swapped out content, as well as the index to swap data structures, such > > > as the swap cache, or the swap cgroup mapping. Tying a swap entry to its > > > backing slot in this way is performant and efficient when swap is purely > > > just disk space, and swapoff is rare. > > > > > > However, the advent of many swap optimizations has exposed major > > > drawbacks of this design. The first problem is that we occupy a physical > > > slot in the swap space, even for pages that are NEVER expected to hit > > > the disk: pages compressed and stored in the zswap pool, zero-filled > > > pages, or pages rejected by both of these optimizations when zswap > > > writeback is disabled. This is the arguably central shortcoming of > > > zswap: > > > * In deployments when no disk space can be afforded for swap (such as > > > mobile and embedded devices), users cannot adopt zswap, and are forced > > > to use zram. This is confusing for users, and creates extra burdens > > > for developers, having to develop and maintain similar features for > > > two separate swap backends (writeback, cgroup charging, THP support, > > > etc.). For instance, see the discussion in [4]. > > > * Resource-wise, it is hugely wasteful in terms of disk usage, and > > > limits the memory saving potentials of these optimizations by the > > > static size of the swapfile, especially in high memory systems that > > > can have up to terabytes worth of memory. It also creates significant > > > challenges for users who rely on swap utilization as an early OOM > > > signal. > > > > > > Another motivation for a swap redesign is to simplify swapoff, which > > > is complicated and expensive in the current design. Tight coupling > > > between a swap entry and its backing storage means that it requires a > > > whole page table walk to update all the page table entries that refer to > > > this swap entry, as well as updating all the associated swap data > > > structures (swap cache, etc.). > > > > > > > > > II. High Level Design Overview > > > > > > To fix the aforementioned issues, we need an abstraction that separates > > > a swap entry from its physical backing storage. IOW, we need to > > > “virtualize” the swap space: swap clients will work with a virtual swap > > > slot (that is dynamically allocated on-demand), storing it in page > > > table entries, and using it to index into various swap-related data > > > structures. > > > > > > The backing storage is decoupled from this slot, and the newly > > > introduced layer will “resolve” the ID to the actual storage, as well > > > as cooperating with the swap cache to handle all the required > > > synchronization. This layer also manages other metadata of the swap > > > entry, such as its lifetime information (swap count), via a dynamically > > > allocated per-entry swap descriptor: > > > > Do you plan to allocate one per-folio or per-page? I suppose it's > > per-page based on the design, but I am wondering if you explored > > having it per-folio. To make it work we'd need to support splitting a > > swp_desc, and figuring out which slot or zswap_entry corresponds to a > > certain page in a folio > > Per-page, for now. Per-folio requires allocating these swp_descs on > huge page splitting etc., which is more complex. We'd also need to allocate them during swapin. If a folio is swapped out as a 16K chunk with a single swp_desc, then we try to swapin one 4K in the middle, we may need to split the swp_desc into 2. > > And yeah, we need to chain these zswap_entry's somehow. Not impossible > certainly, but more overhead and more complexity :) > > > > > > > > > struct swp_desc { > > > swp_entry_t vswap; > > > union { > > > swp_slot_t slot; > > > struct folio *folio; > > > struct zswap_entry *zswap_entry; > > > }; > > > struct rcu_head rcu; > > > > > > rwlock_t lock; > > > enum swap_type type; > > > > > > #ifdef CONFIG_MEMCG > > > atomic_t memcgid; > > > #endif > > > > > > atomic_t in_swapcache; > > > struct kref refcnt; > > > atomic_t swap_count; > > > }; > > > > That seems a bit large. I am assuming this is for the purpose of the > > prototype and we can reduce its size eventually, right? > > Yup. I copied and pasted this from the prototype. Originally I > squeezed all the state (in_swapcache and the swap type) in an > integer-type "flag" field + 1 separate swap count field, and protected > them all with a single rw lock. That gets really ugly/confusing, so > for the sake of the prototype I just separate them all out in their > own fields, and play with atomicity to see if it's possible to do > things locklessly. So far so good (i.e no crashes yet), but the final > form is TBD :) Maybe we can discuss in closer details once I send out > this prototype as an RFC? Yeah, I just had some passing comments. > > (I will say though it looks cleaner when all these fields are > separated. So it's going to be a tradeoff in that sense too). It's a tradeoff but I think we should be able to hide a lot of the complexity behind neat helpers. It's not pretty but I think the memory overhead is an important factor here. > > > > > Particularly, I remember looking into merging the swap_count and > > refcnt, and I am not sure what in_swapcache is (is this a bit? Why > > can't we use a bit from swap_count?). > > Yup. That's a single bit - it's a (partial) replacement for > SWAP_HAS_CACHE state in the existing swap map. > > No particular reason why we can't squeeze it into swap counts other > than clarity :) It's going to be a bit annoying working with swap > count values (swap count increment is now * 2 instead of ++ etc.). Nothing a nice helper cannot hide :)