On Wed, Mar 1, 2023 at 4:30 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@xxxxxxxxxx> wrote: > > > > Hi Yosry, > > > > On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote: > > > Hello everyone, > > > > > > I would like to propose a topic for the upcoming LSF/MM/BPF in May > > > 2023 about swap & zswap (hope I am not too late). > > > > I am very interested in participating in this discussion as well. > > That's great to hear! > > > > > > ==================== Objective ==================== > > > Enabling the use of zswap without a backing swapfile, which makes > > > zswap useful for a wider variety of use cases. Also, when zswap is > > > used with a swapfile, the pages in zswap do not use up space in the > > > swapfile, so the overall swapping capacity increases. > > > > Agree. > > > > > > > > ==================== Idea ==================== > > > Introduce a data structure, which I currently call a swap_desc, as an > > > abstraction layer between swapping implementation and the rest of MM > > > code. Page tables & page caches would store a swap id (encoded as a > > > swp_entry_t) instead of directly storing the swap entry associated > > > with the swapfile. This swap id maps to a struct swap_desc, which acts > > > > Can you provide a bit more detail? I am curious how this swap id > > maps into the swap_desc? Is the swp_entry_t cast into "struct > > swap_desc*" or going through some lookup table/tree? > > swap id would be an index in a radix tree (aka xarray), which contains > a pointer to the swap_desc struct. This lookup should be free with > this design as we also use swap_desc to directly store the swap cache > pointer, so this lookup essentially replaces the swap cache lookup. > > > > > > as our abstraction layer. All MM code not concerned with swapping > > > details would operate in terms of swap descs. The swap_desc can point > > > to either a normal swap entry (associated with a swapfile) or a zswap > > > entry. It can also include all non-backend specific operations, such > > > as the swapcache (which would be a simple pointer in swap_desc), swap > > > > Does the zswap entry still use the swap slot cache and swap_info_struct? > > In this design no, it shouldn't. > > > > > > This work enables using zswap without a backing swapfile and increases > > > the swap capacity when zswap is used with a swapfile. It also creates > > > a separation that allows us to skip code paths that don't make sense > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree > > > which might result in better performance (less lookups, less lock > > > contention). > > > > > > The abstraction layer also opens the door for multiple cleanups (e.g. > > > removing swapper address spaces, removing swap count continuation > > > code, etc). Another nice cleanup that this work enables would be > > > separating the overloaded swp_entry_t into two distinct types: one for > > > things that are stored in page tables / caches, and for actual swap > > > entries. In the future, we can potentially further optimize how we use > > > the bits in the page tables instead of sticking everything into the > > > current type/offset format. > > > > Looking forward to seeing more details in the upcoming discussion. > > > > > > ==================== Cost ==================== > > > The obvious downside of this is added memory overhead, specifically > > > for users that use swapfiles without zswap. Instead of paying one byte > > > (swap_map) for every potential page in the swapfile (+ swap count > > > continuation), we pay the size of the swap_desc for every page that is > > > actually in the swapfile, which I am estimating can be roughly around > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only > > > scales with pages actually swapped out. For zswap users, it should be > > > > Is there a way to avoid turning 1 byte into 24 byte per swapped > > pages? For the users that use swap but no zswap, this is pure overhead. > > That's what I could think of at this point. My idea was something like this: > > struct swap_desc { > union { /* Use one bit to distinguish them */ > swp_entry_t swap_entry; > struct zswap_entry *zswap_entry; > }; > struct folio *swapcache; > atomic_t swap_count; > u32 id; > } > > Having the id in the swap_desc is convenient as we can directly map > the swap_desc to a swp_entry_t to place in the page tables, but I > don't think it's necessary. Without it, the struct size is 20 bytes, > so I think the extra 4 bytes are okay to use anyway if the slab > allocator only allocates multiples of 8 bytes. > > The idea here is to unify the swapcache and swap_count implementation > between different swap backends (swapfiles, zswap, etc), which would > create a better abstraction and reduce reinventing the wheel. > > We can reduce to only 8 bytes and only store the swap/zswap entry, but > we still need the swap cache anyway so might as well just store the > pointer in the struct and have a unified lookup-free swapcache, so > really 16 bytes is the minimum. > > If we stop at 16 bytes, then we need to handle swap count separately > in swapfiles and zswap. This is not the end of the world, but are the > 8 bytes worth this? > > Keep in mind that the current overhead is 1 byte O(max swap pages) not > O(swapped). Also, 1 byte is assuming we do not use the swap > continuation pages. If we do, it may end up being more. We also > allocate continuation in full 4k pages, so even if one swap_map > element in a page requires continuation, we will allocate an entire > page. What I am trying to say is that to get an actual comparison you > need to also factor in the swap utilization and the rate of usage of > swap continuation. I don't know how to come up with a formula for this > tbh. > > Also, like Johannes said, the worst case overhead (32 bytes if you > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every > 1G swapped. It doesn't sound *very* bad. I understand that it is pure > overhead for people not using zswap, but it is not very awful. Oh I forgot. I think the 24 bytes *might* actually be reduced to 16 bytes if we free the underlying swap entry / zswap entry once we add the page to the swapcache. I did not post anything about it yet as I am still thinking if there might be any synchronization problems with this approach, but I will try it out. > > > > > It seems what you really need is one bit of information to indicate > > this page is backed by zswap. Then you can have a seperate pointer > > for the zswap entry. > > If you use one bit in swp_entry_t (or one of the available swap types) > to indicate whether the page is backed with a swapfile or zswap it > doesn't really work. We lose the indirection layer. How do we move the > page from zswap to swapfile? We need to go update the page tables and > the shmem page cache, similar to swapoff. > > Instead, if we store a key else in swp_entry_t and use this to lookup > the swp_entry_t or zswap_entry pointer then that's essentially what > the swap_desc does. It just goes the extra mile of unifying the > swapcache as well and storing it directly in the swap_desc instead of > storing it in another lookup structure. > > > > > Depending on how much you are going to reuse the swap cache, you might > > need to have something like a swap_info_struct to keep the locks happy. > > My current intention is to reimplement the swapcache completely as a > pointer in struct swap_desc. This would eliminate this need and a lot > of the locking we do today if I get things right. > > > > > > Another potential concern is readahead. With this design, we have no > > > > Readahead is for spinning disk :-) Even a normal swap file with an SSD can > > use some modernization. > > Yeah, I initially thought we would only need the swp_entry_t -> > swap_desc reverse mapping for readahead, and that we can only store > that for spinning disks, but I was wrong. We need for other things as > well today: swapoff, when trying to find an empty swap slot and we > start trying to free swap slots used only by the swapcache. However, I > think both of these cases can be fixed (I can share more details if > you want). If everything goes well we should only need to maintain the > reverse mapping (extra overhead above 24 bytes) for swap files on > spinning disks for readahead. > > > > > Looking forward to your discussion. > > > > Chris > >