On Wed, Mar 01, 2023 at 04:58:08PM -0800, Yosry Ahmed wrote: > > The indirection layer would be essential to support it but it would > > be also great if we don't waste any memory for the user who don't > > want the feature. > > I can't currently think of a way to eliminate overhead for people only > using swapfiles, as a lot of the core implementation changes, unless > we want to maintain considerably more code with a lot of repeated > functionality implemented differently. Perhaps this will change as I > implement this, maybe things are better (or worse) than what I think > they are, I am actively working on a proof-of-concept right now. Maybe > a discussion in LSF/MM/BPF will help come up with optimizations as > well :) > > > > > Just FYI, there was similar discussion long time ago about the > > indirection layer. > > https://lore.kernel.org/linux-mm/4DA25039.3020700@xxxxxxxxxx/ > > Yeah Hugh shared this one with me earlier, but there are a few things > that I don't understand how they would work, at least in today's > world. Let's add Rik into the discussion, maybe he can help refresh some details. Chris > > Firstly, the proposal suggests that we store a radix tree index in the > page tables, and in the radix tree store the swap entry AND the swap > count. I am not really sure how they would fit in 8 bytes, especially > if we need continuation and 1 byte is not enough for the swap count. > Continuation logic now depends on linking vmalloc'd pages using the > lru field in struct page/folio. Perhaps we can figure out a split that > gives enough space for swap count without continuation while also not > limiting swapfile sizes too much. > > Secondly, IIUC in that proposal once we swap a page in, we free the > swap entry and add the swapcache page to the radix tree instead. In > that case, where does the swap count go? IIUC we still need to > maintain it to be able to tell when all processes mapping the page > have faulted it back, otherwise the radix tree entry is maintained > indefinitely. We can maybe stash the swap count somewhere else in this > case, and bring it back to the radix tree if we swap the page out > again. Not really sure where, we can have a separate radix tree for > swap counts when the page is in swapcache, or we can always have it in > a separate radix tree so that the swap entry fits comfortably in the > first radix tree. > > To be able to accomodate zswap in this design, I think we always need > a separate radix tree for swap counts. In that case, one radix tree > contains swap_entry/zswap_entry/swapcache, and the other radix tree > contains the swap count. I think this may work, but I am not sure if > the overhead of always doing a lookup to read the swap count is okay. > I am also sure there would be some fun synchronization problems > between both trees (but we already need to synchronize today between > the swapcache and swap counts?). > > It sounds like it is possible to make it work. I will spend some time > thinking about it. Having 2 radix trees also solves the 32-bit systems > problem, but I am not sure if it's a generally better design. Radix > trees also take up some extra space other than the entry size itself, > so I am not sure how much memory we would end up actually saving. > > Johannes, I am curious if you have any thoughts about this alternative design? >