On Thu, Sep 5, 2024 at 10:37 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > > On 05/09/2024 11:10, Barry Song wrote: > > On Thu, Sep 5, 2024 at 8:49 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > >> > >> On Thu, Sep 5, 2024 at 7:55 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > >>> > >>> On Thu, Sep 5, 2024 at 12:03 AM Barry Song <21cnbao@xxxxxxxxx> wrote: > >>>> > >>>> On Thu, Sep 5, 2024 at 5:41 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > >>>>> > >>>>> [..] > >>>>>>> I understand the point of doing this to unblock the synchronous large > >>>>>>> folio swapin support work, but at some point we're gonna have to > >>>>>>> actually handle the cases where a large folio being swapped in is > >>>>>>> partially in the swap cache, zswap, the zeromap, etc. > >>>>>>> > >>>>>>> All these cases will need similar-ish handling, and I suspect we won't > >>>>>>> just skip swapping in large folios in all these cases. > >>>>>> > >>>>>> I agree that this is definitely the goal. `swap_read_folio()` should be a > >>>>>> dependable API that always returns reliable data, regardless of whether > >>>>>> `zeromap` or `zswap` is involved. Despite these issues, mTHP swap-in shouldn't > >>>>>> be held back. Significant efforts are underway to support large folios in > >>>>>> `zswap`, and progress is being made. Not to mention we've already allowed > >>>>>> `zeromap` to proceed, even though it doesn't support large folios. > >>>>>> > >>>>>> It's genuinely unfair to let the lack of mTHP support in `zeromap` and > >>>>>> `zswap` hold swap-in hostage. > >>>>> > >>>> > >>>> Hi Yosry, > >>>> > >>>>> Well, two points here: > >>>>> > >>>>> 1. I did not say that we should block the synchronous mTHP swapin work > >>>>> for this :) I said the next item on the TODO list for mTHP swapin > >>>>> support should be handling these cases. > >>>> > >>>> Thanks for your clarification! > >>>> > >>>>> > >>>>> 2. I think two things are getting conflated here. Zswap needs to > >>>>> support mTHP swapin*. Zeromap already supports mTHPs AFAICT. What is > >>>>> truly, and is outside the scope of zswap/zeromap, is being able to > >>>>> support hybrid mTHP swapin. > >>>>> > >>>>> When swapping in an mTHP, the swapped entries can be on disk, in the > >>>>> swapcache, in zswap, or in the zeromap. Even if all these things > >>>>> support mTHPs individually, we essentially need support to form an > >>>>> mTHP from swap entries in different backends. That's what I meant. > >>>>> Actually if we have that, we may not really need mTHP swapin support > >>>>> in zswap, because we can just form the large folio in the swap layer > >>>>> from multiple zswap entries. > >>>>> > >>>> > >>>> After further consideration, I've actually started to disagree with the idea > >>>> of supporting hybrid swapin (forming an mTHP from swap entries in different > >>>> backends). My reasoning is as follows: > >>> > >>> I do not have any data about this, so you could very well be right > >>> here. Handling hybrid swapin could be simply falling back to the > >>> smallest order we can swapin from a single backend. We can at least > >>> start with this, and collect data about how many mTHP swapins fallback > >>> due to hybrid backends. This way we only take the complexity if > >>> needed. > >>> > >>> I did imagine though that it's possible for two virtually contiguous > >>> folios to be swapped out to contiguous swap entries and end up in > >>> different media (e.g. if only one of them is zero-filled). I am not > >>> sure how rare it would be in practice. > >>> > >>>> > >>>> 1. The scenario where an mTHP is partially zeromap, partially zswap, etc., > >>>> would be an extremely rare case, as long as we're swapping out the mTHP as > >>>> a whole and all the modules are handling it accordingly. It's highly > >>>> unlikely to form this mix of zeromap, zswap, and swapcache unless the > >>>> contiguous VMA virtual address happens to get some small folios with > >>>> aligned and contiguous swap slots. Even then, they would need to be > >>>> partially zeromap and partially non-zeromap, zswap, etc. > >>> > >>> As I mentioned, we can start simple and collect data for this. If it's > >>> rare and we don't need to handle it, that's good. > >>> > >>>> > >>>> As you mentioned, zeromap handles mTHP as a whole during swapping > >>>> out, marking all subpages of the entire mTHP as zeromap rather than just > >>>> a subset of them. > >>>> > >>>> And swap-in can also entirely map a swapcache which is a large folio based > >>>> on our previous patchset which has been in mainline: > >>>> "mm: swap: entirely map large folios found in swapcache" > >>>> https://lore.kernel.org/all/20240529082824.150954-1-21cnbao@xxxxxxxxx/ > >>>> > >>>> It seems the only thing we're missing is zswap support for mTHP. > >>> > >>> It is still possible for two virtually contiguous folios to be swapped > >>> out to contiguous swap entries. It is also possible that a large folio > >>> is swapped out as a whole, then only a part of it is swapped in later > >>> due to memory pressure. If that part is later reclaimed again and gets > >>> added to the swapcache, we can run into the hybrid swapin situation. > >>> There may be other scenarios as well, I did not think this through. > >>> > >>>> > >>>> 2. Implementing hybrid swap-in would be extremely tricky and could disrupt > >>>> several software layers. I can share some pseudo code below: > >>> > >>> Yeah it definitely would be complex, so we need proper justification for it. > >>> > >>>> > >>>> swap_read_folio() > >>>> { > >>>> if (zeromap_full) > >>>> folio_read_from_zeromap() > >>>> else if (zswap_map_full) > >>>> folio_read_from_zswap() > >>>> else { > >>>> folio_read_from_swapfile() > >>>> if (zeromap_partial) > >>>> folio_read_from_zeromap_fixup() /* fill zero > >>>> for partially zeromap subpages */ > >>>> if (zwap_partial) > >>>> folio_read_from_zswap_fixup() /* zswap_load > >>>> for partially zswap-mapped subpages */ > >>>> > >>>> folio_mark_uptodate() > >>>> folio_unlock() > >>>> } > >>>> > >>>> We'd also need to modify folio_read_from_swapfile() to skip > >>>> folio_mark_uptodate() > >>>> and folio_unlock() after completing the BIO. This approach seems to > >>>> entirely disrupt > >>>> the software layers. > >>>> > >>>> This could also lead to unnecessary IO operations for subpages that > >>>> require fixup. > >>>> Since such cases are quite rare, I believe the added complexity isn't worth it. > >>>> > >>>> My point is that we should simply check that all PTEs have consistent zeromap, > >>>> zswap, and swapcache statuses before proceeding, otherwise fall back to the next > >>>> lower order if needed. This approach improves performance and avoids complex > >>>> corner cases. > >>> > >>> Agree that we should start with that, although we should probably > >>> fallback to the largest order we can swapin from a single backend, > >>> rather than the next lower order. > >>> > >>>> > >>>> So once zswap mTHP is there, I would also expect an API similar to > >>>> swap_zeromap_entries_check() > >>>> for example: > >>>> zswap_entries_check(entry, nr) which can return if we are having > >>>> full, non, and partial zswap to replace the existing > >>>> zswap_never_enabled(). > >>> > >>> I think a better API would be similar to what Usama had. Basically > >>> take in (entry, nr) and return how much of it is in zswap starting at > >>> entry, so that we can decide the swapin order. > >>> > >>> Maybe we can adjust your proposed swap_zeromap_entries_check() as well > >>> to do that? Basically return the number of swap entries in the zeromap > >>> starting at 'entry'. If 'entry' itself is not in the zeromap we return > >>> 0 naturally. That would be a small adjustment/fix over what Usama had, > >>> but implementing it with bitmap operations like you did would be > >>> better. > >> > >> I assume you means the below > >> > >> /* > >> * Return the number of contiguous zeromap entries started from entry > >> */ > >> static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry, int nr) > >> { > >> struct swap_info_struct *sis = swp_swap_info(entry); > >> unsigned long start = swp_offset(entry); > >> unsigned long end = start + nr; > >> unsigned long idx; > >> > >> idx = find_next_bit(sis->zeromap, end, start); > >> if (idx != start) > >> return 0; > >> > >> return find_next_zero_bit(sis->zeromap, end, start) - idx; > >> } > >> > >> If yes, I really like this idea. > >> > >> It seems much better than using an enum, which would require adding a new > >> data structure :-) Additionally, returning the number allows callers > >> to fall back > >> to the largest possible order, rather than trying next lower orders > >> sequentially. > > > > No, returning 0 after only checking first entry would still reintroduce > > the current bug, where the start entry is zeromap but other entries > > might not be. We need another value to indicate whether the entries > > are consistent if we want to avoid the enum: > > > > /* > > * Return the number of contiguous zeromap entries started from entry; > > * If all entries have consistent zeromap, *consistent will be true; > > * otherwise, false; > > */ > > static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry, > > int nr, bool *consistent) > > { > > struct swap_info_struct *sis = swp_swap_info(entry); > > unsigned long start = swp_offset(entry); > > unsigned long end = start + nr; > > unsigned long s_idx, c_idx; > > > > s_idx = find_next_bit(sis->zeromap, end, start); > > In all of the implementations you sent, you are using find_next_bit(..,end, start), but > I believe it should be find_next_bit(..,nr, start)? I guess no, the tricky thing is that size means the size from the first bit of bitmap but not from the "start" bit? > > TBH, I liked the enum implementation you had in https://lore.kernel.org/all/20240905002926.1055-1-21cnbao@xxxxxxxxx/ > Its the easiest to review and understand, and least likely to introduce any bugs. > But it could be a personal preference. > The likelihood of having contiguous zeromap entries *that* is less than nr is very low right? > If so we could go with the enum implementation? what about the bool impementation i sent in the last email, it seems the simplest code. > > > > if (s_idx == end) { > > *consistent = true; > > return 0; > > } > > > > c_idx = find_next_zero_bit(sis->zeromap, end, start); > > if (c_idx == end) { > > *consistent = true; > > return nr; > > } > > > > *consistent = false; > > if (s_idx == start) > > return 0; > > return c_idx - s_idx; > > } > > > > I can actually switch the places of the "consistent" and returned > > number if that looks > > better. > > > >> > >> Hi Usama, > >> what is your take on this? > >> > >>> > >>>> > >>>> Though I am not sure how cheap zswap can implement it, > >>>> swap_zeromap_entries_check() > >>>> could be two simple bit operations: > >>>> > >>>> +static inline zeromap_stat_t swap_zeromap_entries_check(swp_entry_t > >>>> entry, int nr) > >>>> +{ > >>>> + struct swap_info_struct *sis = swp_swap_info(entry); > >>>> + unsigned long start = swp_offset(entry); > >>>> + unsigned long end = start + nr; > >>>> + > >>>> + if (find_next_bit(sis->zeromap, end, start) == end) > >>>> + return SWAP_ZEROMAP_NON; > >>>> + if (find_next_zero_bit(sis->zeromap, end, start) == end) > >>>> + return SWAP_ZEROMAP_FULL; > >>>> + > >>>> + return SWAP_ZEROMAP_PARTIAL; > >>>> +} > >>>> > >>>> 3. swapcache is different from zeromap and zswap. Swapcache indicates > >>>> that the memory > >>>> is still available and should be re-mapped rather than allocating a > >>>> new folio. Our previous > >>>> patchset has implemented a full re-map of an mTHP in do_swap_page() as mentioned > >>>> in 1. > >>>> > >>>> For the same reason as point 1, partial swapcache is a rare edge case. > >>>> Not re-mapping it > >>>> and instead allocating a new folio would add significant complexity. > >>>> > >>>>>> > >>>>>> Nonetheless, `zeromap` and `zswap` are distinct cases. With `zeromap`, we > >>>>>> permit almost all mTHP swap-ins, except for those rare situations where > >>>>>> small folios that were swapped out happen to have contiguous and aligned > >>>>>> swap slots. > >>>>>> > >>>>>> swapcache is another quite different story, since our user scenarios begin from > >>>>>> the simplest sync io on mobile phones, we don't quite care about swapcache. > >>>>> > >>>>> Right. The reason I bring this up is as I mentioned above, there is a > >>>>> common problem of forming large folios from different sources, which > >>>>> includes the swap cache. The fact that synchronous swapin does not use > >>>>> the swapcache was a happy coincidence for you, as you can add support > >>>>> mTHP swapins without handling this case yet ;) > >>>> > >>>> As I mentioned above, I'd really rather filter out those corner cases > >>>> than support > >>>> them, not just for the current situation to unlock swap-in series :-) > >>> > >>> If they are indeed corner cases, then I definitely agree. > >> Thanks Barry