Re: [PATCH v4 1/2] mm: store zero pages to be swapped out in a bitmap

Usama Arif <usamaarif642@xxxxxxxxx> · Thu, 5 Sep 2024 11:37:00 +0100

On 05/09/2024 11:10, Barry Song wrote:
> On Thu, Sep 5, 2024 at 8:49 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
>>
>> On Thu, Sep 5, 2024 at 7:55 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>>>
>>> On Thu, Sep 5, 2024 at 12:03 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
>>>>
>>>> On Thu, Sep 5, 2024 at 5:41 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>>>>>
>>>>> [..]
>>>>>>> I understand the point of doing this to unblock the synchronous large
>>>>>>> folio swapin support work, but at some point we're gonna have to
>>>>>>> actually handle the cases where a large folio being swapped in is
>>>>>>> partially in the swap cache, zswap, the zeromap, etc.
>>>>>>>
>>>>>>> All these cases will need similar-ish handling, and I suspect we won't
>>>>>>> just skip swapping in large folios in all these cases.
>>>>>>
>>>>>> I agree that this is definitely the goal. `swap_read_folio()` should be a
>>>>>> dependable API that always returns reliable data, regardless of whether
>>>>>> `zeromap` or `zswap` is involved. Despite these issues, mTHP swap-in shouldn't
>>>>>> be held back. Significant efforts are underway to support large folios in
>>>>>> `zswap`, and progress is being made. Not to mention we've already allowed
>>>>>> `zeromap` to proceed, even though it doesn't support large folios.
>>>>>>
>>>>>> It's genuinely unfair to let the lack of mTHP support in `zeromap` and
>>>>>> `zswap` hold swap-in hostage.
>>>>>
>>>>
>>>> Hi Yosry,
>>>>
>>>>> Well, two points here:
>>>>>
>>>>> 1. I did not say that we should block the synchronous mTHP swapin work
>>>>> for this :) I said the next item on the TODO list for mTHP swapin
>>>>> support should be handling these cases.
>>>>
>>>> Thanks for your clarification!
>>>>
>>>>>
>>>>> 2. I think two things are getting conflated here. Zswap needs to
>>>>> support mTHP swapin*. Zeromap already supports mTHPs AFAICT. What is
>>>>> truly, and is outside the scope of zswap/zeromap, is being able to
>>>>> support hybrid mTHP swapin.
>>>>>
>>>>> When swapping in an mTHP, the swapped entries can be on disk, in the
>>>>> swapcache, in zswap, or in the zeromap. Even if all these things
>>>>> support mTHPs individually, we essentially need support to form an
>>>>> mTHP from swap entries in different backends. That's what I meant.
>>>>> Actually if we have that, we may not really need mTHP swapin support
>>>>> in zswap, because we can just form the large folio in the swap layer
>>>>> from multiple zswap entries.
>>>>>
>>>>
>>>> After further consideration, I've actually started to disagree with the idea
>>>> of supporting hybrid swapin (forming an mTHP from swap entries in different
>>>> backends). My reasoning is as follows:
>>>
>>> I do not have any data about this, so you could very well be right
>>> here. Handling hybrid swapin could be simply falling back to the
>>> smallest order we can swapin from a single backend. We can at least
>>> start with this, and collect data about how many mTHP swapins fallback
>>> due to hybrid backends. This way we only take the complexity if
>>> needed.
>>>
>>> I did imagine though that it's possible for two virtually contiguous
>>> folios to be swapped out to contiguous swap entries and end up in
>>> different media (e.g. if only one of them is zero-filled). I am not
>>> sure how rare it would be in practice.
>>>
>>>>
>>>> 1. The scenario where an mTHP is partially zeromap, partially zswap, etc.,
>>>> would be an extremely rare case, as long as we're swapping out the mTHP as
>>>> a whole and all the modules are handling it accordingly. It's highly
>>>> unlikely to form this mix of zeromap, zswap, and swapcache unless the
>>>> contiguous VMA virtual address happens to get some small folios with
>>>> aligned and contiguous swap slots. Even then, they would need to be
>>>> partially zeromap and partially non-zeromap, zswap, etc.
>>>
>>> As I mentioned, we can start simple and collect data for this. If it's
>>> rare and we don't need to handle it, that's good.
>>>
>>>>
>>>> As you mentioned, zeromap handles mTHP as a whole during swapping
>>>> out, marking all subpages of the entire mTHP as zeromap rather than just
>>>> a subset of them.
>>>>
>>>> And swap-in can also entirely map a swapcache which is a large folio based
>>>> on our previous patchset which has been in mainline:
>>>> "mm: swap: entirely map large folios found in swapcache"
>>>> https://lore.kernel.org/all/20240529082824.150954-1-21cnbao@xxxxxxxxx/
>>>>
>>>> It seems the only thing we're missing is zswap support for mTHP.
>>>
>>> It is still possible for two virtually contiguous folios to be swapped
>>> out to contiguous swap entries. It is also possible that a large folio
>>> is swapped out as a whole, then only a part of it is swapped in later
>>> due to memory pressure. If that part is later reclaimed again and gets
>>> added to the swapcache, we can run into the hybrid swapin situation.
>>> There may be other scenarios as well, I did not think this through.
>>>
>>>>
>>>> 2. Implementing hybrid swap-in would be extremely tricky and could disrupt
>>>> several software layers. I can share some pseudo code below:
>>>
>>> Yeah it definitely would be complex, so we need proper justification for it.
>>>
>>>>
>>>> swap_read_folio()
>>>> {
>>>>        if (zeromap_full)
>>>>                folio_read_from_zeromap()
>>>>        else if (zswap_map_full)
>>>>               folio_read_from_zswap()
>>>>        else {
>>>>               folio_read_from_swapfile()
>>>>               if (zeromap_partial)
>>>>                        folio_read_from_zeromap_fixup()  /* fill zero
>>>> for partially zeromap subpages */
>>>>               if (zwap_partial)
>>>>                        folio_read_from_zswap_fixup()  /* zswap_load
>>>> for partially zswap-mapped subpages */
>>>>
>>>>                folio_mark_uptodate()
>>>>                folio_unlock()
>>>> }
>>>>
>>>> We'd also need to modify folio_read_from_swapfile() to skip
>>>> folio_mark_uptodate()
>>>> and folio_unlock() after completing the BIO. This approach seems to
>>>> entirely disrupt
>>>> the software layers.
>>>>
>>>> This could also lead to unnecessary IO operations for subpages that
>>>> require fixup.
>>>> Since such cases are quite rare, I believe the added complexity isn't worth it.
>>>>
>>>> My point is that we should simply check that all PTEs have consistent zeromap,
>>>> zswap, and swapcache statuses before proceeding, otherwise fall back to the next
>>>> lower order if needed. This approach improves performance and avoids complex
>>>> corner cases.
>>>
>>> Agree that we should start with that, although we should probably
>>> fallback to the largest order we can swapin from a single backend,
>>> rather than the next lower order.
>>>
>>>>
>>>> So once zswap mTHP is there, I would also expect an API similar to
>>>> swap_zeromap_entries_check()
>>>> for example:
>>>> zswap_entries_check(entry, nr) which can return if we are having
>>>> full, non, and partial zswap to replace the existing
>>>> zswap_never_enabled().
>>>
>>> I think a better API would be similar to what Usama had. Basically
>>> take in (entry, nr) and return how much of it is in zswap starting at
>>> entry, so that we can decide the swapin order.
>>>
>>> Maybe we can adjust your proposed swap_zeromap_entries_check() as well
>>> to do that? Basically return the number of swap entries in the zeromap
>>> starting at 'entry'. If 'entry' itself is not in the zeromap we return
>>> 0 naturally. That would be a small adjustment/fix over what Usama had,
>>> but implementing it with bitmap operations like you did would be
>>> better.
>>
>> I assume you means the below
>>
>> /*
>>  * Return the number of contiguous zeromap entries started from entry
>>  */
>> static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry, int nr)
>> {
>>         struct swap_info_struct *sis = swp_swap_info(entry);
>>         unsigned long start = swp_offset(entry);
>>         unsigned long end = start + nr;
>>         unsigned long idx;
>>
>>         idx = find_next_bit(sis->zeromap, end, start);
>>         if (idx != start)
>>                 return 0;
>>
>>         return find_next_zero_bit(sis->zeromap, end, start) - idx;
>> }
>>
>> If yes, I really like this idea.
>>
>> It seems much better than using an enum, which would require adding a new
>> data structure :-) Additionally, returning the number allows callers
>> to fall back
>> to the largest possible order, rather than trying next lower orders
>> sequentially.
> 
> No, returning 0 after only checking first entry would still reintroduce
> the current bug, where the start entry is zeromap but other entries
> might not be. We need another value to indicate whether the entries
> are consistent if we want to avoid the enum:
> 
> /*
>  * Return the number of contiguous zeromap entries started from entry;
>  * If all entries have consistent zeromap, *consistent will be true;
>  * otherwise, false;
>  */
> static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry,
>                 int nr, bool *consistent)
> {
>         struct swap_info_struct *sis = swp_swap_info(entry);
>         unsigned long start = swp_offset(entry);
>         unsigned long end = start + nr;
>         unsigned long s_idx, c_idx;
> 
>         s_idx = find_next_bit(sis->zeromap, end, start);

In all of the implementations you sent, you are using find_next_bit(..,end, start), but
I believe it should be find_next_bit(..,nr, start)?

TBH, I liked the enum implementation you had in https://lore.kernel.org/all/20240905002926.1055-1-21cnbao@xxxxxxxxx/
Its the easiest to review and understand, and least likely to introduce any bugs.
But it could be a personal preference.
The likelihood of having contiguous zeromap entries *that* is less than nr is very low right?
If so we could go with the enum implementation?

>         if (s_idx == end) {
>                 *consistent = true;
>                 return 0;
>         }
> 
>         c_idx = find_next_zero_bit(sis->zeromap, end, start);
>         if (c_idx == end) {
>                 *consistent = true;
>                 return nr;
>         }
> 
>         *consistent = false;
>         if (s_idx == start)
>                 return 0;
>         return c_idx - s_idx;
> }
> 
> I can actually switch the places of the "consistent" and returned
> number if that looks
> better.
> 
>>
>> Hi Usama,
>> what is your take on this?
>>
>>>
>>>>
>>>> Though I am not sure how cheap zswap can implement it,
>>>> swap_zeromap_entries_check()
>>>> could be two simple bit operations:
>>>>
>>>> +static inline zeromap_stat_t swap_zeromap_entries_check(swp_entry_t
>>>> entry, int nr)
>>>> +{
>>>> +       struct swap_info_struct *sis = swp_swap_info(entry);
>>>> +       unsigned long start = swp_offset(entry);
>>>> +       unsigned long end = start + nr;
>>>> +
>>>> +       if (find_next_bit(sis->zeromap, end, start) == end)
>>>> +               return SWAP_ZEROMAP_NON;
>>>> +       if (find_next_zero_bit(sis->zeromap, end, start) == end)
>>>> +               return SWAP_ZEROMAP_FULL;
>>>> +
>>>> +       return SWAP_ZEROMAP_PARTIAL;
>>>> +}
>>>>
>>>> 3. swapcache is different from zeromap and zswap. Swapcache indicates
>>>> that the memory
>>>> is still available and should be re-mapped rather than allocating a
>>>> new folio. Our previous
>>>> patchset has implemented a full re-map of an mTHP in do_swap_page() as mentioned
>>>> in 1.
>>>>
>>>> For the same reason as point 1, partial swapcache is a rare edge case.
>>>> Not re-mapping it
>>>> and instead allocating a new folio would add significant complexity.
>>>>
>>>>>>
>>>>>> Nonetheless, `zeromap` and `zswap` are distinct cases. With `zeromap`, we
>>>>>> permit almost all mTHP swap-ins, except for those rare situations where
>>>>>> small folios that were swapped out happen to have contiguous and aligned
>>>>>> swap slots.
>>>>>>
>>>>>> swapcache is another quite different story, since our user scenarios begin from
>>>>>> the simplest sync io on mobile phones, we don't quite care about swapcache.
>>>>>
>>>>> Right. The reason I bring this up is as I mentioned above, there is a
>>>>> common problem of forming large folios from different sources, which
>>>>> includes the swap cache. The fact that synchronous swapin does not use
>>>>> the swapcache was a happy coincidence for you, as you can add support
>>>>> mTHP swapins without handling this case yet ;)
>>>>
>>>> As I mentioned above, I'd really rather filter out those corner cases
>>>> than support
>>>> them, not just for the current situation to unlock swap-in series :-)
>>>
>>> If they are indeed corner cases, then I definitely agree.
>>
>> Thanks
>> Barry