On Tue, May 7, 2024 at 1:16 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 06.05.24 14:58, Barry Song wrote: > > On Tue, May 7, 2024 at 12:38 AM Barry Song <21cnbao@xxxxxxxxx> wrote: > >> > >> On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > >>> > >>> On 04.05.24 01:23, Barry Song wrote: > >>>> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >>>>> > >>>>> On 03/05/2024 01:50, Barry Song wrote: > >>>>>> From: Chuanhua Han <hanchuanhua@xxxxxxxx> > >>>>>> > >>>>>> When a large folio is found in the swapcache, the current implementation > >>>>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages > >>>>>> page faults. This patch opts to map the entire large folio at once to > >>>>>> minimize page faults. Additionally, redundant checks and early exits > >>>>>> for ARM64 MTE restoring are removed. > >>>>>> > >>>>>> Signed-off-by: Chuanhua Han <hanchuanhua@xxxxxxxx> > >>>>>> Co-developed-by: Barry Song <v-songbaohua@xxxxxxxx> > >>>>>> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> > >>>>> > >>>>> With the suggested changes below: > >>>>> > >>>>> Reviewed-by: Ryan Roberts <ryan.roberts@xxxxxxx> > >>>>> > >>>>>> --- > >>>>>> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++----------- > >>>>>> 1 file changed, 48 insertions(+), 12 deletions(-) > >>>>>> > >>>>>> diff --git a/mm/memory.c b/mm/memory.c > >>>>>> index 22e7c33cc747..940fdbe69fa1 100644 > >>>>>> --- a/mm/memory.c > >>>>>> +++ b/mm/memory.c > >>>>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>>>>> pte_t pte; > >>>>>> vm_fault_t ret = 0; > >>>>>> void *shadow = NULL; > >>>>>> + int nr_pages = 1; > >>>>>> + unsigned long page_idx = 0; > >>>>>> + unsigned long address = vmf->address; > >>>>>> + pte_t *ptep; > >>>>> > >>>>> nit: Personally I'd prefer all these to get initialised just before the "if > >>>>> (folio_test_large()..." block below. That way it is clear they are fresh (incase > >>>>> any logic between here and there makes an adjustment) and its clear that they > >>>>> are only to be used after that block (the compiler will warn if using an > >>>>> uninitialized value). > >>>> > >>>> right. I agree this will make the code more readable. > >>>> > >>>>> > >>>>>> > >>>>>> if (!pte_unmap_same(vmf)) > >>>>>> goto out; > >>>>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>>>>> goto out_nomap; > >>>>>> } > >>>>>> > >>>>>> + ptep = vmf->pte; > >>>>>> + if (folio_test_large(folio) && folio_test_swapcache(folio)) { > >>>>>> + int nr = folio_nr_pages(folio); > >>>>>> + unsigned long idx = folio_page_idx(folio, page); > >>>>>> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE; > >>>>>> + unsigned long folio_end = folio_start + nr * PAGE_SIZE; > >>>>>> + pte_t *folio_ptep; > >>>>>> + pte_t folio_pte; > >>>>>> + > >>>>>> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start))) > >>>>>> + goto check_folio; > >>>>>> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end))) > >>>>>> + goto check_folio; > >>>>>> + > >>>>>> + folio_ptep = vmf->pte - idx; > >>>>>> + folio_pte = ptep_get(folio_ptep); > >>>>>> + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || > >>>>>> + swap_pte_batch(folio_ptep, nr, folio_pte) != nr) > >>>>>> + goto check_folio; > >>>>>> + > >>>>>> + page_idx = idx; > >>>>>> + address = folio_start; > >>>>>> + ptep = folio_ptep; > >>>>>> + nr_pages = nr; > >>>>>> + entry = folio->swap; > >>>>>> + page = &folio->page; > >>>>>> + } > >>>>>> + > >>>>>> +check_folio: > >>>>> > >>>>> Is this still the correct label name, given the checks are now above the new > >>>>> block? Perhaps "one_page" or something like that? > >>>> > >>>> not quite sure about this, as the code after one_page can be multiple_pages. > >>>> On the other hand, it seems we are really checking folio after "check_folio" > >>>> :-) > >>>> > >>>> > >>>> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio)); > >>>> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page)); > >>>> > >>>> /* > >>>> * Check under PT lock (to protect against concurrent fork() sharing > >>>> * the swap entry concurrently) for certainly exclusive pages. > >>>> */ > >>>> if (!folio_test_ksm(folio)) { > >>>> > >>>> > >>>>> > >>>>>> + > >>>>>> /* > >>>>>> * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte > >>>>>> * must never point at an anonymous page in the swapcache that is > >>>>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>>>>> * We're already holding a reference on the page but haven't mapped it > >>>>>> * yet. > >>>>>> */ > >>>>>> - swap_free_nr(entry, 1); > >>>>>> + swap_free_nr(entry, nr_pages); > >>>>>> if (should_try_to_free_swap(folio, vma, vmf->flags)) > >>>>>> folio_free_swap(folio); > >>>>>> > >>>>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > >>>>>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS); > >>>>>> + folio_ref_add(folio, nr_pages - 1); > >>>>>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); > >>>>>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); > >>>>>> pte = mk_pte(page, vma->vm_page_prot); > >>>>>> > >>>>>> /* > >>>>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>>>>> * exclusivity. > >>>>>> */ > >>>>>> if (!folio_test_ksm(folio) && > >>>>>> - (exclusive || folio_ref_count(folio) == 1)) { > >>>>>> + (exclusive || (folio_ref_count(folio) == nr_pages && > >>>>>> + folio_nr_pages(folio) == nr_pages))) { > >>>>> > >>>>> I think in practice there is no change here? If nr_pages > 1 then the folio is > >>>>> in the swapcache, so there is an extra ref on it? I agree with the change for > >>>>> robustness sake. Just checking my understanding. > >>>> > >>>> This is the code showing we are reusing/(mkwrite) a folio either > >>>> 1. we meet a small folio and we are the only one hitting the small folio > >>>> 2. we meet a large folio and we are the only one hitting the large folio > >>>> > >>>> any corner cases besides the above two seems difficult. for example, > >>>> > >>>> while we hit a large folio in swapcache but if we can't entirely map it > >>>> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio) > >>>> == nr_pages == 1 > >>> > >>> No, there would be other references from the swapcache and > >>> folio_ref_count(folio) > 1. See my other reply. > >> > >> right. can be clearer by: > > > > Wait, do we still need folio_nr_pages(folio) == nr_pages even if we use > > folio_ref_count(folio) == 1 and moving folio_ref_add(folio, nr_pages - 1)? > > I don't think that we will "need" it. > > > > > one case is that we have a large folio with 16 PTEs, and we unmap > > 15 swap PTE entries, thus we have only one swap entry left. Then > > we hit the large folio in swapcache. but we have only one PTE thus we will > > map only one PTE. lacking folio_nr_pages(folio) == nr_pages, we reuse the > > large folio for one PTE. with it, do_wp_page() will migrate the large > > folio to a small one? > > We will set PAE bit and do_wp_page() will unconditionally reuse that page. > > Note that this is the same as if we had pte_swp_exclusive() set and > would have run into "exclusive=true" here. > > If we'd want a similar "optimization" as we have in > wp_can_reuse_anon_folio(), you'd want something like > > exclusive || (folio_ref_count(folio) == 1 && > (!folio_test_large(folio) || nr_pages > 1) I feel like A : !folio_test_large(folio) || nr_pages > 1 equals B: folio_nr_pages(folio) == nr_pages if folio is small, folio_test_large(folio) is false, both A and B will be true; if folio is large, and we map the whole large folio, A will be true because of nr_pages > 1; B is also true; if folio is large, and we map single one PTE, A will be false; B is also false, because nr_pages == 1 but folio_nr_pages(folio) > 1; right? However, I agree that delving into this complexity might not be necessary at the moment. > > ... but I am not sure if that is really worth the complexity here. > > > > > 1AM, tired and sleepy. not quite sure I am correct. > > I look forward to seeing your reply tomorrow morning :-) > > Heh, no need to dream about this ;) > > -- > Cheers, > > David / dhildenb Thanks Barry