On Wed, Mar 1, 2023 at 4:58 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > On Tue, Feb 28, 2023 at 3:29 PM Minchan Kim <minchan@xxxxxxxxxx> wrote: > > > > Hi Yosry, > > > > On Tue, Feb 28, 2023 at 12:12:05AM -0800, Yosry Ahmed wrote: > > > On Mon, Feb 27, 2023 at 8:54 PM Sergey Senozhatsky > > > <senozhatsky@xxxxxxxxxxxx> wrote: > > > > > > > > On (23/02/18 14:38), Yosry Ahmed wrote: > > > > [..] > > > > > ==================== Idea ==================== > > > > > Introduce a data structure, which I currently call a swap_desc, as an > > > > > abstraction layer between swapping implementation and the rest of MM > > > > > code. Page tables & page caches would store a swap id (encoded as a > > > > > swp_entry_t) instead of directly storing the swap entry associated > > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts > > > > > as our abstraction layer. All MM code not concerned with swapping > > > > > details would operate in terms of swap descs. The swap_desc can point > > > > > to either a normal swap entry (associated with a swapfile) or a zswap > > > > > entry. It can also include all non-backend specific operations, such > > > > > as the swapcache (which would be a simple pointer in swap_desc), swap > > > > > counting, etc. It creates a clear, nice abstraction layer between MM > > > > > code and the actual swapping implementation. > > > > > > > > > > ==================== Benefits ==================== > > > > > This work enables using zswap without a backing swapfile and increases > > > > > the swap capacity when zswap is used with a swapfile. It also creates > > > > > a separation that allows us to skip code paths that don't make sense > > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree > > > > > which might result in better performance (less lookups, less lock > > > > > contention). > > > > > > > > > > The abstraction layer also opens the door for multiple cleanups (e.g. > > > > > removing swapper address spaces, removing swap count continuation > > > > > code, etc). Another nice cleanup that this work enables would be > > > > > separating the overloaded swp_entry_t into two distinct types: one for > > > > > things that are stored in page tables / caches, and for actual swap > > > > > entries. In the future, we can potentially further optimize how we use > > > > > the bits in the page tables instead of sticking everything into the > > > > > current type/offset format. > > > > > > > > > > Another potential win here can be swapoff, which can be more practical > > > > > by directly scanning all swap_desc's instead of going through page > > > > > tables and shmem page caches. > > > > > > > > > > Overall zswap becomes more accessible and available to a wider range > > > > > of use cases. > > > > > > > > I assume this also brings us closer to a proper writeback LRU handling? > > > > > > I assume by proper LRU handling you mean: > > > - Swap writeback LRU that lives outside of the zpool backends (i.e in > > > zswap itself or even outside zswap). > > > > Even outside zswap to support any combination on any heterogenous > > multiple swap device configuration. > > Agreed, this is the end goal for the writeback LRU. > > > > > The indirection layer would be essential to support it but it would > > be also great if we don't waste any memory for the user who don't > > want the feature. > > I can't currently think of a way to eliminate overhead for people only > using swapfiles, as a lot of the core implementation changes, unless > we want to maintain considerably more code with a lot of repeated > functionality implemented differently. Perhaps this will change as I > implement this, maybe things are better (or worse) than what I think > they are, I am actively working on a proof-of-concept right now. Maybe > a discussion in LSF/MM/BPF will help come up with optimizations as > well :) > > > > > Just FYI, there was similar discussion long time ago about the > > indirection layer. > > https://lore.kernel.org/linux-mm/4DA25039.3020700@xxxxxxxxxx/ > > Yeah Hugh shared this one with me earlier, but there are a few things > that I don't understand how they would work, at least in today's > world. > > Firstly, the proposal suggests that we store a radix tree index in the > page tables, and in the radix tree store the swap entry AND the swap > count. I am not really sure how they would fit in 8 bytes, especially > if we need continuation and 1 byte is not enough for the swap count. > Continuation logic now depends on linking vmalloc'd pages using the > lru field in struct page/folio. Perhaps we can figure out a split that > gives enough space for swap count without continuation while also not > limiting swapfile sizes too much. > > Secondly, IIUC in that proposal once we swap a page in, we free the > swap entry and add the swapcache page to the radix tree instead. In > that case, where does the swap count go? IIUC we still need to > maintain it to be able to tell when all processes mapping the page > have faulted it back, otherwise the radix tree entry is maintained > indefinitely. We can maybe stash the swap count somewhere else in this > case, and bring it back to the radix tree if we swap the page out > again. Not really sure where, we can have a separate radix tree for > swap counts when the page is in swapcache, or we can always have it in > a separate radix tree so that the swap entry fits comfortably in the > first radix tree. > > To be able to accomodate zswap in this design, I think we always need > a separate radix tree for swap counts. In that case, one radix tree > contains swap_entry/zswap_entry/swapcache, and the other radix tree > contains the swap count. I think this may work, but I am not sure if > the overhead of always doing a lookup to read the swap count is okay. > I am also sure there would be some fun synchronization problems > between both trees (but we already need to synchronize today between > the swapcache and swap counts?). > > It sounds like it is possible to make it work. I will spend some time > thinking about it. Having 2 radix trees also solves the 32-bit systems > problem, but I am not sure if it's a generally better design. Radix > trees also take up some extra space other than the entry size itself, > so I am not sure how much memory we would end up actually saving. > > Johannes, I am curious if you have any thoughts about this alternative design? I completely forgot about shadow entries here. I don't think any of this works with shadow entries as we still need to maintain them while the page is swapped out.