On Wed, Mar 22, 2023 at 8:17 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > Yosry Ahmed <yosryahmed@xxxxxxxxxx> writes: > > > On Wed, Mar 22, 2023 at 6:50 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> > >> Yosry Ahmed <yosryahmed@xxxxxxxxxx> writes: > >> > >> > On Sun, Mar 19, 2023 at 7:56 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> >> > >> >> Yosry Ahmed <yosryahmed@xxxxxxxxxx> writes: > >> >> > >> >> > On Thu, Mar 16, 2023 at 12:51 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> >> >> > >> >> >> Yosry Ahmed <yosryahmed@xxxxxxxxxx> writes: > >> >> >> > >> >> >> > On Sun, Mar 12, 2023 at 7:13 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> >> >> >> > >> >> >> >> Yosry Ahmed <yosryahmed@xxxxxxxxxx> writes: > >> >> >> >> > >> >> >> >> <snip> > >> >> >> >> > > >> >> >> >> > My current idea is to have one xarray that stores the swap_descs > >> >> >> >> > (which include swap_entry, swapcache, swap_count, etc), and only for > >> >> >> >> > rotating disks have an additional xarray that maps swap_entry -> > >> >> >> >> > swap_desc for cluster readahead, assuming we can eliminate all other > >> >> >> >> > situations requiring a reverse mapping. > >> >> >> >> > > >> >> >> >> > I am not sure how having separate xarrays help? If we have one xarray, > >> >> >> >> > might as well save the other lookups on put everything in swap_desc. > >> >> >> >> > In fact, this should improve the locking today as swapcache / > >> >> >> >> > swap_count operations can be lockless or very lightly contended. > >> >> >> >> > >> >> >> >> The condition of the proposal is "reverse mapping cannot be avoided for > >> >> >> >> enough situation". So, if reverse mapping (or cluster readahead) can be > >> >> >> >> avoided for enough situations, I think your proposal is good. Otherwise, > >> >> >> >> I propose to use 2 xarrays. You don't need another reverse mapping > >> >> >> >> xarray, because you just need to read the next several swap_entry into > >> >> >> >> the swap cache for cluster readahead. swap_desc isn't needed for > >> >> >> >> cluster readahead. > >> >> >> > > >> >> >> > swap_desc would be needed for cluster readahead in my original > >> >> >> > proposal as the swap cache lives in swap_descs. Based on the current > >> >> >> > implementation, we would need a reverse mapping (swap entry -> > >> >> >> > swap_desc) in 3 situations: > >> >> >> > > >> >> >> > 1) __try_to_reclaim_swap(): when trying to find an empty swap slot and > >> >> >> > failing, we fallback to trying to find swap entries that only have a > >> >> >> > page in the swap cache (no references in page tables or page cache) > >> >> >> > and free them. This would require a reverse mapping. > >> >> >> > > >> >> >> > 2) swapoff: we need to swap in all entries in a swapfile, so we need > >> >> >> > to get all swap_descs associated with that swapfile. > >> >> >> > > >> >> >> > 3) swap cluster readahead. > >> >> >> > > >> >> >> > For (1), I think we can drop the dependency of a reverse mapping if we > >> >> >> > free swap entries once we swap a page in and add it to the swap cache, > >> >> >> > even if the swap count does not drop to 0. > >> >> >> > >> >> >> Now, we will not drop the swap cache even if the swap count becomes 0 if > >> >> >> swap space utility < 50%. Per my understanding, this avoid swap page > >> >> >> writing for read accesses. So I don't think we can change this directly > >> >> >> without necessary discussion firstly. > >> >> > > >> >> > > >> >> > Right. I am not sure I understand why we do this today, is it to save > >> >> > the overhead of allocating a new swap entry if the page is swapped out > >> >> > again soon? I am not sure I understand this statement "this avoid swap > >> >> > page > >> >> > writing for read accesses". > >> >> > > >> >> >> > >> >> >> > >> >> >> > For (2), instead of scanning page tables and shmem page cache to find > >> >> >> > swapped out pages for the swapfile, we can scan all swap_descs > >> >> >> > instead, we should be more efficient. This is one of the proposal's > >> >> >> > potential advantages. > >> >> >> > >> >> >> Good. > >> >> >> > >> >> >> > (3) is the one that would still need a reverse mapping with the > >> >> >> > current proposal. Today we use swap cluster readahead for anon pages > >> >> >> > if we have a spinning disk or vma readahead is disabled. For shmem, we > >> >> >> > always use cluster readahead. If we can limit cluster readahead to > >> >> >> > only rotating disks, then the reverse mapping can only be maintained > >> >> >> > for swapfiles on rotating disks. Otherwise, we will need to maintain a > >> >> >> > reverse mapping for all swapfiles. > >> >> >> > >> >> >> For shmem, I think that it should be good to readahead based on shmem > >> >> >> file offset instead of swap device offset. > >> >> >> > >> >> >> It's possible that some pages in the readahead window are from HDD while > >> >> >> some other pages aren't. So it's a little hard to enable cluster read > >> >> >> for HDD only. Anyway, it's not common to use HDD for swap now. > >> >> >> > >> >> >> >> > >> >> >> >> > If the point is to store the swap_desc directly inside the xarray to > >> >> >> >> > save 8 bytes, I am concerned that having multiple xarrays for > >> >> >> >> > swapcache, swap_count, etc will use more than that. > >> >> >> >> > >> >> >> >> The idea is to save the memory used by reverse mapping xarray. > >> >> >> > > >> >> >> > I see. > >> >> >> > > >> >> >> >> > >> >> >> >> >> >> > >> >> >> >> >> >> > Keep in mind that the current overhead is 1 byte O(max swap pages) not > >> >> >> >> >> >> > O(swapped). Also, 1 byte is assuming we do not use the swap > >> >> >> >> >> >> > continuation pages. If we do, it may end up being more. We also > >> >> >> >> >> >> > allocate continuation in full 4k pages, so even if one swap_map > >> >> >> >> >> >> > element in a page requires continuation, we will allocate an entire > >> >> >> >> >> >> > page. What I am trying to say is that to get an actual comparison you > >> >> >> >> >> >> > need to also factor in the swap utilization and the rate of usage of > >> >> >> >> >> >> > swap continuation. I don't know how to come up with a formula for this > >> >> >> >> >> >> > tbh. > >> >> >> >> >> >> > > >> >> >> >> >> >> > Also, like Johannes said, the worst case overhead (32 bytes if you > >> >> >> >> >> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every > >> >> >> >> >> >> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure > >> >> >> >> >> >> > overhead for people not using zswap, but it is not very awful. > >> >> >> >> >> >> > > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> It seems what you really need is one bit of information to indicate > >> >> >> >> >> >> >> this page is backed by zswap. Then you can have a seperate pointer > >> >> >> >> >> >> >> for the zswap entry. > >> >> >> >> >> >> > > >> >> >> >> >> >> > If you use one bit in swp_entry_t (or one of the available swap types) > >> >> >> >> >> >> > to indicate whether the page is backed with a swapfile or zswap it > >> >> >> >> >> >> > doesn't really work. We lose the indirection layer. How do we move the > >> >> >> >> >> >> > page from zswap to swapfile? We need to go update the page tables and > >> >> >> >> >> >> > the shmem page cache, similar to swapoff. > >> >> >> >> >> >> > > >> >> >> >> >> >> > Instead, if we store a key else in swp_entry_t and use this to lookup > >> >> >> >> >> >> > the swp_entry_t or zswap_entry pointer then that's essentially what > >> >> >> >> >> >> > the swap_desc does. It just goes the extra mile of unifying the > >> >> >> >> >> >> > swapcache as well and storing it directly in the swap_desc instead of > >> >> >> >> >> >> > storing it in another lookup structure. > >> >> >> >> >> >> > >> >> >> >> >> >> If we choose to make sizeof(struct swap_desc) == 8, that is, store only > >> >> >> >> >> >> swap_entry in swap_desc. The added indirection appears to be another > >> >> >> >> >> >> level of page table with 1 entry. Then, we may use the similar method > >> >> >> >> >> >> as supporting system with 2 level and 3 level page tables, like the code > >> >> >> >> >> >> in include/asm-generic/pgtable-nopmd.h. But I haven't thought about > >> >> >> >> >> >> this deeply. > >> >> >> >> >> > > >> >> >> >> >> > Can you expand further on this idea? I am not sure I fully understand. > >> >> >> >> >> > >> >> >> >> >> OK. The goal is to avoid the overhead if indirection isn't enabled via > >> >> >> >> >> kconfig. > >> >> >> >> >> > >> >> >> >> >> If indirection isn't enabled, store swap_entry in PTE directly. > >> >> >> >> >> Otherwise, store index of swap_desc in PTE. Different functions (e.g., > >> >> >> >> >> to get/set swap_entry in PTE) are implemented based on kconfig. > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > I thought about this, the problem is that we will have multiple > >> >> >> >> > implementations of multiple things. For example, swap_count without > >> >> >> >> > the indirection layer lives in the swap_map (with continuation logic). > >> >> >> >> > With the indirection layer, it lives in the swap_desc (or somewhere > >> >> >> >> > else). Same for the swapcache. Even if we keep the swapcache in an > >> >> >> >> > xarray and not inside swap_desc, it would be indexed by swap_entry if > >> >> >> >> > the indirection is disabled, and by swap_desc (or similar) if the > >> >> >> >> > indirection is enabled. I think maintaining separate implementations > >> >> >> >> > for when the indirection is enabled/disabled would be adding too much > >> >> >> >> > complexity. > >> >> >> >> > > >> >> >> >> > WDYT? > >> >> >> >> > >> >> >> >> If we go this way, swap cache and swap_count will always be indexed by > >> >> >> >> swap_entry. swap_desc just provides a indirection to make it possible > >> >> >> >> to move between swap devices. > >> >> >> >> > >> >> >> >> Why must we index swap cache and swap_count by swap_desc if indirection > >> >> >> >> is enabled? Yes, we can save one xarray indexing if we do so, but I > >> >> >> >> don't think the overhead of one xarray indexing is a showstopper. > >> >> >> >> > >> >> >> >> I think this can be one intermediate step towards your final target. > >> >> >> >> The changes to current implementation can be smaller. > >> >> >> > > >> >> >> > IIUC, the idea is to have two xarrays: > >> >> >> > (a) xarray that stores a pointer to a struct containing swap_count and > >> >> >> > swap cache. > >> >> >> > (b) xarray that stores the underlying swap entry or zswap entry. > >> >> >> > > >> >> >> > When indirection is disabled: > >> >> >> > page tables & page cache have swap entry directly like today, xarray > >> >> >> > (a) is indexed by swap entry, xarray (b) does not exist. No reverse > >> >> >> > mapping needed. > >> >> >> > > >> >> >> > In this case we have an extra overhead of 12-16 bytes (the struct > >> >> >> > containing swap_count and swap cache) vs. 24 bytes of the swap_desc. > >> >> >> > > >> >> >> > When indirection is enabled: > >> >> >> > page tables & page cache have a swap id (or swap_desc index), xarray > >> >> >> > (a) is indexed by swap id, > >> >> >> > >> >> >> xarray (a) is indexed by swap entry. > >> >> > > >> >> > > >> >> > How so? With the indirection enabled, the page tables & page cache > >> >> > have the swap id (or swap_desc index), which can point to a swap entry > >> >> > or a zswap entry -- which can change when the page is moved between > >> >> > zswap & swapfiles. How is xarray (a) indexed by the swap entry in this > >> >> > case? Shouldn't be indexed by the abstract swap id so that the > >> >> > writeback from zswap is transparent? > >> >> > >> >> In my mind, > >> >> > >> >> - swap core will define a abstract interface to swap implementations > >> >> (zswap, swap device/file, maybe more in the future), like VFS. > >> >> > >> >> - zswap will be a special swap implementation (compressing instead of > >> >> writing to disk). > >> >> > >> >> - swap core will manage the indirection layer and swap cache. > >> >> > >> >> - swap core can move swap pages between swap implementations (e.g., from > >> >> zswap to a swap device, or from one swap device to another swap > >> >> device) with the help of the indirection layer. > >> >> > >> >> In this design, the writeback from zswap becomes moving swapped pages > >> >> from zswap to a swap device. > >> > > >> > > >> > All the above matches my understanding of this proposal. swap_desc is > >> > the proposed indirection layer, and the swap implementations are zswap > >> > & swap devices. For now, we only have 2 static swap implementations > >> > (zswap->swapfile). In the future, we can make this more dynamic as the > >> > need arises. > >> > >> Great to align with you on this. > > > > Great! > > > >> > >> >> > >> >> If my understanding were correct, your suggestion is kind of moving > >> >> zswap logic to the swap core? And zswap will be always at a higher > >> >> layer on top of swap device/file? > >> > > >> > > >> > We do not want to move the zswap logic into the swap core, we want to > >> > make the swap core independent of the swap implementation, and zswap > >> > is just one possible implementation. > >> > >> Good! > >> > >> I found that you put zswap related data structure inside struct > >> swap_desc directly. I think that we should avoid that as much as > >> possible. > > > > That was just an initial draft to show what it would be like, I do not > > intend to hardcode zswap-specific items in swap_desc. > > > >> > >> >> > >> >> >> > >> >> >> > >> >> >> > xarray (b) is indexed by swap id as well > >> >> >> > and contain swap entry or zswap entry. Reverse mapping might be > >> >> >> > needed. > >> >> >> > >> >> >> Reverse mapping isn't needed. > >> >> > > >> >> > > >> >> > It would be needed if xarray (a) is indexed by the swap id. I am not > >> >> > sure I understand how it can be indexed by the swap entry if the > >> >> > indirection is enabled. > >> >> > > >> >> >> > >> >> >> > >> >> >> > In this case we have an extra overhead of 12-16 bytes + 8 bytes for > >> >> >> > xarray (b) entry + memory overhead from 2nd xarray + reverse mapping > >> >> >> > where needed. > >> >> >> > > >> >> >> > There is also the extra cpu overhead for an extra lookup in certain paths. > >> >> >> > > >> >> >> > Is my analysis correct? If yes, I agree that the original proposal is > >> >> >> > good if the reverse mapping can be avoided in enough situations, and > >> >> >> > that we should consider such alternatives otherwise. As I mentioned > >> >> >> > above, I think it comes down to whether we can completely restrict > >> >> >> > cluster readahead to rotating disks or not -- in which case we need to > >> >> >> > decide what to do for shmem and for anon when vma readahead is > >> >> >> > disabled. > >> >> >> > >> >> >> We can even have a minimal indirection implementation. Where, swap > >> >> >> cache and swap_map[] are kept as they ware before, just one xarray is > >> >> >> added. The xarray is indexed by swap id (or swap_desc index) to store > >> >> >> the corresponding swap entry. > >> >> >> > >> >> >> When indirection is disabled, no extra overhead. > >> >> >> > >> >> >> When indirection is enabled, the extra overhead is just 8 bytes per > >> >> >> swapped page. > >> >> >> > >> >> >> The basic migration support can be build on top of this. > >> >> >> > >> >> >> I think that this could be a baseline for indirection support. Then > >> >> >> further optimization can be built on top of it step by step with > >> >> >> supporting data. > >> >> > > >> >> > > >> >> > I am not sure how this works with zswap. Currently swap_map[] > >> >> > implementation is specific for swapfiles, it does not work for zswap > >> >> > unless we implement separate swap counting logic for zswap & > >> >> > swapfiles. Same for the swapcache, it currently supports being indexed > >> >> > by a swap entry, it would need to support being indexed by a swap id, > >> >> > or have a separate swap cache for zswap. Having separate > >> >> > implementation would add complexity, and we would need to perform > >> >> > handoffs of the swap count/cache when a page is moved from zswap to a > >> >> > swapfile. > >> >> > >> >> We can allocate a swap entry for each swapped page in zswap. > >> > > >> > > >> > This is exactly what the current implementation does and what we want > >> > to move away from. The current implementation uses zswap as an > >> > in-memory compressed cache on top of an actual swap device, and each > >> > swapped page in zswap has a swap entry allocated. With this > >> > implementation, zswap cannot be used without a swap device. > >> > >> I totally agree that we should avoid to use an actual swap device under > >> zswap. And, as an swap implementation, zswap can manage the swap entry > >> inside zswap without an underlying actual swap device. For example, > >> when we swap a page to zswap (actually compress), we can allocate a > >> (virtual) swap entry in the zswap. I understand that there's overhead > >> to manage the swap entry in zswap. We can consider how to reduce the > >> overhead. > > > > I see. So we can (for example) use one of the swap types for zswap, > > and then have zswap code handle this entry according to its > > implementation. We can then have an xarray that maps swap ID -> swap > > entry, and this swap entry is used to index the swap cache and such. > > When a swapped page is moved between backends we update the swap ID -> > > swap entry xarray. > > > > This is one possible implementation that I thought of (very briefly > > tbh), but it does have its problems: > > For zswap: > > - Managing swap entries inside zswap unnecessarily. > > - We need to maintain a swap entry -> zswap entry mapping in zswap -- > > similar to the current rbtree, which is something that we can get rid > > of with the initial proposal if we embed the zswap_entry pointer > > directly in the swap_desc (it can be encoded to avoid breaking the > > abstraction). > > > > For mm/swap in general: > > - When we allocate a swap entry today, we store it in folio->private > > (or page->private), which is used by the unmapping code to be placed > > in the page tables or shmem page cache. With this implementation, we > > need to store the swap ID in page->private instead, which means that > > every time we need to access the swap cache during reclaim/swapout we > > need to lookup the swap entry first. > > - On the fault path, we need two lookups instead of one (swap ID -> > > swap entry, swap entry -> swap cache), not sure how this affects fault > > latency. > > - Each swap backend will have its own separate implementation of swap > > counting, which is hard to maintain and very error-prone since the > > logic is backend-agnostic. > > - Handing over a page from one swap backend to another includes > > handing over swap cache entries and swap counts, which I imagine will > > involve considerable synchronization. > > > > Do you have any thoughts on this? > > Yes. I understand there's additional overhead. I have no clear idea > about how to reduce this now. We need to think about that in depth. > > The bottom line is whether this is worse than the current zswap > implementation? It's not just zswap, as I note above, this design would introduce some overheads to the core swapping code as well as long as the indirection layer is active. I am particularly worried about the extra lookups on the fault path. For zswap, we already have a lookup today, so maintaining swap entry -> zswap entry mapping would not be a regression, but I am not sure about the extra overhead to manage swap entries within zswap. Keep in mind that using swap entries for zswap probably implies having a fixed/max size for zswap (to be able to manage the swap entries efficiently similar to swap devices), which is a limitation that the initial proposal was hoping to overcome. > > Best Regards, > Huang, Ying > > >> > >> >> > >> >> > >> >> >> > >> >> >> > >> >> >> >> > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Depending on how much you are going to reuse the swap cache, you might > >> >> >> >> >> >> >> need to have something like a swap_info_struct to keep the locks happy. > >> >> >> >> >> >> > > >> >> >> >> >> >> > My current intention is to reimplement the swapcache completely as a > >> >> >> >> >> >> > pointer in struct swap_desc. This would eliminate this need and a lot > >> >> >> >> >> >> > of the locking we do today if I get things right. > >> >> >> >> >> >> > > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > Another potential concern is readahead. With this design, we have no > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Readahead is for spinning disk :-) Even a normal swap file with an SSD can > >> >> >> >> >> >> >> use some modernization. > >> >> >> >> >> >> > > >> >> >> >> >> >> > Yeah, I initially thought we would only need the swp_entry_t -> > >> >> >> >> >> >> > swap_desc reverse mapping for readahead, and that we can only store > >> >> >> >> >> >> > that for spinning disks, but I was wrong. We need for other things as > >> >> >> >> >> >> > well today: swapoff, when trying to find an empty swap slot and we > >> >> >> >> >> >> > start trying to free swap slots used only by the swapcache. However, I > >> >> >> >> >> >> > think both of these cases can be fixed (I can share more details if > >> >> >> >> >> >> > you want). If everything goes well we should only need to maintain the > >> >> >> >> >> >> > reverse mapping (extra overhead above 24 bytes) for swap files on > >> >> >> >> >> >> > spinning disks for readahead. > >> >> >> >> >> >> > > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Looking forward to your discussion. > >> >> >> >> >> > >> >> >> >> >> Per my understanding, the indirection is to make it easy to move > >> >> >> >> >> (swapped) pages among swap devices based on hot/cold. This is similar > >> >> >> >> >> as the target of memory tiering. It appears that we can extend the > >> >> >> >> >> memory tiering (mm/memory-tiers.c) framework to cover swap devices too? > >> >> >> >> >> Is it possible for zswap to be faster than some slow memory media? > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > Agree with Chris that this may require a much larger overhaul. A slow > >> >> >> >> > memory tier is still addressable memory, swap/zswap requires a page > >> >> >> >> > fault to read the pages. I think (at least for now) there is a > >> >> >> >> > fundamental difference. We want reclaim to eventually treat slow > >> >> >> >> > memory & swap as just different tiers to place cold memory in with > >> >> >> >> > different characteristics, but otherwise I think the swapping > >> >> >> >> > implementation itself is very different. Am I missing something? > >> >> >> >> > >> >> >> >> Is it possible that zswap is faster than a really slow memory > >> >> >> >> addressable device backed by NAND? TBH, I don't have the answer. > >> >> >> > > >> >> >> > I am not sure either. > >> >> >> > > >> >> >> >> > >> >> >> >> Anyway, do you need a way to describe the tiers of the swap devices? > >> >> >> >> So, you can move the cold pages among the swap devices based on that? > >> >> >> > > >> >> >> > For now I think the "tiers" in this proposal are just zswap and normal > >> >> >> > swapfiles. We can later extend it to support more explicit tiering. > >> >> >> > >> >> >> IIUC, in original zswap implementation, there's 1:1 relationship between > >> >> >> zswap and normal swapfile. But now, you make demoting among swap > >> >> >> devices more general. Then we need some general way to specify which > >> >> >> swap devices are fast and which are slow, and the demoting relationship > >> >> >> among them. It can be memory tiers or something else, but we need one. > >> >> > > >> >> > > >> >> > I think for this proposal, there are only 2 hardcoded tiers. Zswap is > >> >> > fast, swapfile is slow. In the future, we can support more dynamic > >> >> > tiering if the need arises. > >> >> > >> >> We can start from a simple implementation. And I think that it's better > >> >> to consider the general design too. Try not to make it impossible now. > >> > > >> > > >> > Right. I am proposing we come up with an abstract generic interface > >> > for swap implementations, and have 2 implementations statically > >> > defined (swapfiles and zswap). If the need arises, we can make swap > >> > implementations more dynamic in the future. > >> > > >> >> > >> >> > >> >> Best Regards, > >> >> Huang, Ying > >> >