On Wed, Feb 5, 2025 at 2:11 AM Yosry Ahmed <yosry.ahmed@xxxxxxxxx> wrote: > > On Wed, Feb 05, 2025 at 12:46:26AM +0800, Kairui Song wrote: > > On Wed, Feb 5, 2025 at 12:24 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > > > Hi Kairui, > > > > > > On Tue, Feb 04, 2025 at 07:44:46PM +0800, Kairui Song wrote: > > > > Hi all, sorry for the late submission. > > > > > > > > Following previous work and topics with the SWAP allocator > > > > [1][2][3][4], this topic would propose a way to redesign and integrate > > > > multiple swap data into the swap allocator, which should be a > > > > future-proof design, achieving following benefits: > > > > - Even lower memory usage than the current design > > > > - Higher performance (Remove HAS_CACHE pin trampoline) > > > > - Dynamic allocation and growth support, further reducing idle memory usage > > > > - Unifying the swapin path for a more maintainable code base (Remove SYNC_IO) > > > > - More extensible, provide a clean bedrock for implementing things > > > > like discontinuous swapout, readahead based mTHP swapin and more. > > > > > > > > People have been complaining about the SWAP management subsystem [5]. > > > > Many incremental workarounds and optimizations are added, but causes > > > > many other problems eg. [6][7][8][9] and making implementing new > > > > features more difficult. One reason is the current design almost has > > > > the minimal memory usage (1 byte swap map) with acceptable > > > > performance, so it's hard to beat with incremental changes. But > > > > actually as more code and features are added, there are already lots > > > > of duplicated parts. So I'm proposing this idea to overhaul whole SWAP > > > > slot management from a different aspect, as the following work on the > > > > SWAP allocator [2]. > > > > > > > > Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea of > > > > unifying swap data, we worked together to implement the short term > > > > solution first: The swap allocator was the bottleneck for performance > > > > and fragmentation issues. The new cluster allocator solved these > > > > issues, and turned the cluster into a basic swap management unit. > > > > It also removed slot cache freeing path, and I'll post another series > > > > soon to remove the slot cache allocation path, so folios will always > > > > interact with the SWAP allocator directly, preparing for this long > > > > term goal: > > > > > > > > A brief intro of the new design > > > > =============================== > > > > > > > > It will first be a drop-in replacement for swap cache, using a per > > > > cluster table to handle all things required for SWAP management. > > > > Compared to the previous attempt to unify swap cache [11], this will > > > > have lower overhead with more features achievable: > > > > > > > > struct swap_cluster_info { > > > > spinlock_t lock; > > > > u16 count; > > > > u8 flags; > > > > u8 order; > > > > + void *table; /* 512 entries */ > > > > struct list_head list; > > > > }; > > > > > > > > The table itself can have variants of format, but for basic usage, > > > > each void* could be in one of the following type: > > > > > > > > /* > > > > * a NULL: | ----------- 0 ------------| - Empty slot > > > > * a Shadow: | SWAP_COUNT |---- Shadow ----|XX1| - Swaped out > > > > * a PFN: | SWAP_COUNT |------ PFN -----|X10| - Cached > > > > * a Pointer: |----------- Pointer ---------|100| - Reserved / Unused yet > > > > * SWAP_COUNT is still 8 bits. > > > > */ > > > > > > > > Clearly it can hold both cache and swap count. The shadow still has > > > > enough for distance (using 16M as buckets for 52 bit VA) or gen > > > > counting. For COUNT_CONTINUED, it can simply allocate another 512 > > > > atomics for one cluster. > > > > > > > > The table is protected by ci->lock, which has little to none contention. > > > > It also gets rid of the "HAS_CACHE bit setting vs Cache Insert", > > > > "HAS_CACHE pin as trampoline" issue, deprecating SWP_SYNCHRONOUS_IO. > > > > And remove the "multiple smaller file in one bit swapfile" design. > > > > > > > > It will further remove the swap cgroup map. Cached folio (stored as > > > > PFN) or shadow can provide such info. Some careful audit and workflow > > > > redesign might be needed. > > > > > > > > Each entry will be 8 bytes, smaller than current (8 bytes cache) + (2 > > > > bytes cgroup map) + (1 bytes SWAP Map) = 11 bytes. > > > > > > > > Shadow reclaim and high order storing are still doable too, by > > > > introducing dense cluster tables formats. We can even optimize it > > > > specially for shmem to have 1 bit per entry. And empty clusters can > > > > have their table freed. This part might be optional. > > > > > > > > And it can have more types for supporting things like entry migrations > > > > or virtual swapfile. The example formats above showed four types. Last > > > > three or more bits can be used as a type indicator, as HAS_CACHE and > > > > COUNT_CONTINUED will be gone. > > > > > > > Hi Johannes > > > > > My understanding is that this would still tie the swap space to > > > configured swapfiles. That aspect of the current design has more and > > > more turned into a problem, because we now have several categories of > > > swap entries that either permanently or for extended periods of time > > > live in memory. Such entries should not occupy actual disk space. > > > > > > The oldest one is probably partially refaulted entries (where one out > > > of N swapped page tables faults back in). We currently have to spend > > > full pages of both memory AND disk space for these. > > > > > > The newest ones are zero-filled entries which are stored in a bitmap. > > > > > > Then there is zswap. You mention ghost swapfiles - I know some setups > > > do this to use zswap purely for compression. But zswap is a writeback > > > cache for real swapfiles primarily, and it is used as such. That means > > > entries need to be able to move from the compressed pool to disk at > > > some point, but might not for a long time. Tying the compressed pool > > > size to disk space is hugely wasteful and an operational headache. > > > > > > So I think any future proof design for the swap allocator needs to > > > decouple the virtual memory layer (page table count, swapcache, memcg > > > linkage, shadow info) from the physical layer (swapfile slot). > > > > > > Can you touch on that concern? > > > > Yes, I fully understand your concern. The purpose of this swap table > > design is to provide a base for building other parts, including > > decoupling the virtual layer from the physical layer. > > > > The table entry can have different types, so virtual file/space can > > leverage this too. For example the virtual layer can have something > > like a "redirection entry" pointing to a physical device layer. Or > > just a pointer to anything that could possibly be used (In the four > > examples I provided one type is a pointer). A swap space will need > > something to index its data. > > We have already internally deployed a very similar solution for > > multi-layer swapout, and it's working well, we are expecting to > > upstreamly implement it and deprecate the downstream solution. > > > > Using an optional layer for doing so still consumes very little memory > > (16 bytes per entry for two layers, and this might be doable just with > > single layer), And there are setups that doesn't need a extra layer, > > such setups can ignore that part and have only 8 bytes per entry, > > having a very low overhead. > > IIUC with this design we still have a fixed-size swap space, but it's > not directly tied to the physical swap layer (i.e. it can be backed with > a swap slot on disk, zswap, zero-filled pages, etc). Did I get this > right? > > In this case, using clusters to manage this should be an implementation > detail that is not visible to userspace. Ideally the kernel would > allocate more clusters dynamically as needed, and when a swap entry is > being allocated in that cluster the kernel chooses the backing for that > swap entry based on the available options. > > I see the benefit of managing things on the cluster level to reduce > memory overhead (e.g. one lock per cluster vs. per entry), and to > leverage existing code where it makes sense. Yes, agree, cluster based map means we can have many empty clusters without consuming any pre-reserved map memory. And extending the cluster array should be doable too. > > However, what we should *not* do is have these clusters be tied to the > disk swap space with the ability to redirect some entries to use > someting like zswap. This does not fix the problem Johannes is > describing. Yes, a virtual swap file can have its own swap space, which is indexed by the cache / table, and reuse all the logic. As long as we don't dramatically change the kernel swapout path, adding a folio to swapcache seems a very reasonable way to avoid redundant IO, and synchronize it upon swapin/swapout, and reusing a lot of infrastructure, even if that's a virtual file. For example a current busy loop issue can be just fixed by leveraging the folio lock: https://lore.kernel.org/lkml/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@xxxxxxxxxxxxxx/ The virtual file/space can be decoupled from the lower device. But the virtual file/space's table entry can point to an underlying physical SWAP device or some meta struct.