On Thu, Sep 5, 2024 at 7:55 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > On Thu, Sep 5, 2024 at 12:03 AM Barry Song <21cnbao@xxxxxxxxx> wrote: > > > > On Thu, Sep 5, 2024 at 5:41 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > [..] > > > > > I understand the point of doing this to unblock the synchronous large > > > > > folio swapin support work, but at some point we're gonna have to > > > > > actually handle the cases where a large folio being swapped in is > > > > > partially in the swap cache, zswap, the zeromap, etc. > > > > > > > > > > All these cases will need similar-ish handling, and I suspect we won't > > > > > just skip swapping in large folios in all these cases. > > > > > > > > I agree that this is definitely the goal. `swap_read_folio()` should be a > > > > dependable API that always returns reliable data, regardless of whether > > > > `zeromap` or `zswap` is involved. Despite these issues, mTHP swap-in shouldn't > > > > be held back. Significant efforts are underway to support large folios in > > > > `zswap`, and progress is being made. Not to mention we've already allowed > > > > `zeromap` to proceed, even though it doesn't support large folios. > > > > > > > > It's genuinely unfair to let the lack of mTHP support in `zeromap` and > > > > `zswap` hold swap-in hostage. > > > > > > > Hi Yosry, > > > > > Well, two points here: > > > > > > 1. I did not say that we should block the synchronous mTHP swapin work > > > for this :) I said the next item on the TODO list for mTHP swapin > > > support should be handling these cases. > > > > Thanks for your clarification! > > > > > > > > 2. I think two things are getting conflated here. Zswap needs to > > > support mTHP swapin*. Zeromap already supports mTHPs AFAICT. What is > > > truly, and is outside the scope of zswap/zeromap, is being able to > > > support hybrid mTHP swapin. > > > > > > When swapping in an mTHP, the swapped entries can be on disk, in the > > > swapcache, in zswap, or in the zeromap. Even if all these things > > > support mTHPs individually, we essentially need support to form an > > > mTHP from swap entries in different backends. That's what I meant. > > > Actually if we have that, we may not really need mTHP swapin support > > > in zswap, because we can just form the large folio in the swap layer > > > from multiple zswap entries. > > > > > > > After further consideration, I've actually started to disagree with the idea > > of supporting hybrid swapin (forming an mTHP from swap entries in different > > backends). My reasoning is as follows: > > I do not have any data about this, so you could very well be right > here. Handling hybrid swapin could be simply falling back to the > smallest order we can swapin from a single backend. We can at least > start with this, and collect data about how many mTHP swapins fallback > due to hybrid backends. This way we only take the complexity if > needed. > > I did imagine though that it's possible for two virtually contiguous > folios to be swapped out to contiguous swap entries and end up in > different media (e.g. if only one of them is zero-filled). I am not > sure how rare it would be in practice. > > > > > 1. The scenario where an mTHP is partially zeromap, partially zswap, etc., > > would be an extremely rare case, as long as we're swapping out the mTHP as > > a whole and all the modules are handling it accordingly. It's highly > > unlikely to form this mix of zeromap, zswap, and swapcache unless the > > contiguous VMA virtual address happens to get some small folios with > > aligned and contiguous swap slots. Even then, they would need to be > > partially zeromap and partially non-zeromap, zswap, etc. > > As I mentioned, we can start simple and collect data for this. If it's > rare and we don't need to handle it, that's good. > > > > > As you mentioned, zeromap handles mTHP as a whole during swapping > > out, marking all subpages of the entire mTHP as zeromap rather than just > > a subset of them. > > > > And swap-in can also entirely map a swapcache which is a large folio based > > on our previous patchset which has been in mainline: > > "mm: swap: entirely map large folios found in swapcache" > > https://lore.kernel.org/all/20240529082824.150954-1-21cnbao@xxxxxxxxx/ > > > > It seems the only thing we're missing is zswap support for mTHP. > > It is still possible for two virtually contiguous folios to be swapped > out to contiguous swap entries. It is also possible that a large folio > is swapped out as a whole, then only a part of it is swapped in later > due to memory pressure. If that part is later reclaimed again and gets > added to the swapcache, we can run into the hybrid swapin situation. > There may be other scenarios as well, I did not think this through. > > > > > 2. Implementing hybrid swap-in would be extremely tricky and could disrupt > > several software layers. I can share some pseudo code below: > > Yeah it definitely would be complex, so we need proper justification for it. > > > > > swap_read_folio() > > { > > if (zeromap_full) > > folio_read_from_zeromap() > > else if (zswap_map_full) > > folio_read_from_zswap() > > else { > > folio_read_from_swapfile() > > if (zeromap_partial) > > folio_read_from_zeromap_fixup() /* fill zero > > for partially zeromap subpages */ > > if (zwap_partial) > > folio_read_from_zswap_fixup() /* zswap_load > > for partially zswap-mapped subpages */ > > > > folio_mark_uptodate() > > folio_unlock() > > } > > > > We'd also need to modify folio_read_from_swapfile() to skip > > folio_mark_uptodate() > > and folio_unlock() after completing the BIO. This approach seems to > > entirely disrupt > > the software layers. > > > > This could also lead to unnecessary IO operations for subpages that > > require fixup. > > Since such cases are quite rare, I believe the added complexity isn't worth it. > > > > My point is that we should simply check that all PTEs have consistent zeromap, > > zswap, and swapcache statuses before proceeding, otherwise fall back to the next > > lower order if needed. This approach improves performance and avoids complex > > corner cases. > > Agree that we should start with that, although we should probably > fallback to the largest order we can swapin from a single backend, > rather than the next lower order. > > > > > So once zswap mTHP is there, I would also expect an API similar to > > swap_zeromap_entries_check() > > for example: > > zswap_entries_check(entry, nr) which can return if we are having > > full, non, and partial zswap to replace the existing > > zswap_never_enabled(). > > I think a better API would be similar to what Usama had. Basically > take in (entry, nr) and return how much of it is in zswap starting at > entry, so that we can decide the swapin order. > > Maybe we can adjust your proposed swap_zeromap_entries_check() as well > to do that? Basically return the number of swap entries in the zeromap > starting at 'entry'. If 'entry' itself is not in the zeromap we return > 0 naturally. That would be a small adjustment/fix over what Usama had, > but implementing it with bitmap operations like you did would be > better. I assume you means the below /* * Return the number of contiguous zeromap entries started from entry */ static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry, int nr) { struct swap_info_struct *sis = swp_swap_info(entry); unsigned long start = swp_offset(entry); unsigned long end = start + nr; unsigned long idx; idx = find_next_bit(sis->zeromap, end, start); if (idx != start) return 0; return find_next_zero_bit(sis->zeromap, end, start) - idx; } If yes, I really like this idea. It seems much better than using an enum, which would require adding a new data structure :-) Additionally, returning the number allows callers to fall back to the largest possible order, rather than trying next lower orders sequentially. Hi Usama, what is your take on this? > > > > > Though I am not sure how cheap zswap can implement it, > > swap_zeromap_entries_check() > > could be two simple bit operations: > > > > +static inline zeromap_stat_t swap_zeromap_entries_check(swp_entry_t > > entry, int nr) > > +{ > > + struct swap_info_struct *sis = swp_swap_info(entry); > > + unsigned long start = swp_offset(entry); > > + unsigned long end = start + nr; > > + > > + if (find_next_bit(sis->zeromap, end, start) == end) > > + return SWAP_ZEROMAP_NON; > > + if (find_next_zero_bit(sis->zeromap, end, start) == end) > > + return SWAP_ZEROMAP_FULL; > > + > > + return SWAP_ZEROMAP_PARTIAL; > > +} > > > > 3. swapcache is different from zeromap and zswap. Swapcache indicates > > that the memory > > is still available and should be re-mapped rather than allocating a > > new folio. Our previous > > patchset has implemented a full re-map of an mTHP in do_swap_page() as mentioned > > in 1. > > > > For the same reason as point 1, partial swapcache is a rare edge case. > > Not re-mapping it > > and instead allocating a new folio would add significant complexity. > > > > > > > > > > Nonetheless, `zeromap` and `zswap` are distinct cases. With `zeromap`, we > > > > permit almost all mTHP swap-ins, except for those rare situations where > > > > small folios that were swapped out happen to have contiguous and aligned > > > > swap slots. > > > > > > > > swapcache is another quite different story, since our user scenarios begin from > > > > the simplest sync io on mobile phones, we don't quite care about swapcache. > > > > > > Right. The reason I bring this up is as I mentioned above, there is a > > > common problem of forming large folios from different sources, which > > > includes the swap cache. The fact that synchronous swapin does not use > > > the swapcache was a happy coincidence for you, as you can add support > > > mTHP swapins without handling this case yet ;) > > > > As I mentioned above, I'd really rather filter out those corner cases > > than support > > them, not just for the current situation to unlock swap-in series :-) > > If they are indeed corner cases, then I definitely agree. Thanks Barry