On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 04.05.24 01:23, Barry Song wrote: > > On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >> > >> On 03/05/2024 01:50, Barry Song wrote: > >>> From: Chuanhua Han <hanchuanhua@xxxxxxxx> > >>> > >>> When a large folio is found in the swapcache, the current implementation > >>> requires calling do_swap_page() nr_pages times, resulting in nr_pages > >>> page faults. This patch opts to map the entire large folio at once to > >>> minimize page faults. Additionally, redundant checks and early exits > >>> for ARM64 MTE restoring are removed. > >>> > >>> Signed-off-by: Chuanhua Han <hanchuanhua@xxxxxxxx> > >>> Co-developed-by: Barry Song <v-songbaohua@xxxxxxxx> > >>> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> > >> > >> With the suggested changes below: > >> > >> Reviewed-by: Ryan Roberts <ryan.roberts@xxxxxxx> > >> > >>> --- > >>> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++----------- > >>> 1 file changed, 48 insertions(+), 12 deletions(-) > >>> > >>> diff --git a/mm/memory.c b/mm/memory.c > >>> index 22e7c33cc747..940fdbe69fa1 100644 > >>> --- a/mm/memory.c > >>> +++ b/mm/memory.c > >>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>> pte_t pte; > >>> vm_fault_t ret = 0; > >>> void *shadow = NULL; > >>> + int nr_pages = 1; > >>> + unsigned long page_idx = 0; > >>> + unsigned long address = vmf->address; > >>> + pte_t *ptep; > >> > >> nit: Personally I'd prefer all these to get initialised just before the "if > >> (folio_test_large()..." block below. That way it is clear they are fresh (incase > >> any logic between here and there makes an adjustment) and its clear that they > >> are only to be used after that block (the compiler will warn if using an > >> uninitialized value). > > > > right. I agree this will make the code more readable. > > > >> > >>> > >>> if (!pte_unmap_same(vmf)) > >>> goto out; > >>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>> goto out_nomap; > >>> } > >>> > >>> + ptep = vmf->pte; > >>> + if (folio_test_large(folio) && folio_test_swapcache(folio)) { > >>> + int nr = folio_nr_pages(folio); > >>> + unsigned long idx = folio_page_idx(folio, page); > >>> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE; > >>> + unsigned long folio_end = folio_start + nr * PAGE_SIZE; > >>> + pte_t *folio_ptep; > >>> + pte_t folio_pte; > >>> + > >>> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start))) > >>> + goto check_folio; > >>> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end))) > >>> + goto check_folio; > >>> + > >>> + folio_ptep = vmf->pte - idx; > >>> + folio_pte = ptep_get(folio_ptep); > >>> + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || > >>> + swap_pte_batch(folio_ptep, nr, folio_pte) != nr) > >>> + goto check_folio; > >>> + > >>> + page_idx = idx; > >>> + address = folio_start; > >>> + ptep = folio_ptep; > >>> + nr_pages = nr; > >>> + entry = folio->swap; > >>> + page = &folio->page; > >>> + } > >>> + > >>> +check_folio: > >> > >> Is this still the correct label name, given the checks are now above the new > >> block? Perhaps "one_page" or something like that? > > > > not quite sure about this, as the code after one_page can be multiple_pages. > > On the other hand, it seems we are really checking folio after "check_folio" > > :-) > > > > > > BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio)); > > BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page)); > > > > /* > > * Check under PT lock (to protect against concurrent fork() sharing > > * the swap entry concurrently) for certainly exclusive pages. > > */ > > if (!folio_test_ksm(folio)) { > > > > > >> > >>> + > >>> /* > >>> * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte > >>> * must never point at an anonymous page in the swapcache that is > >>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>> * We're already holding a reference on the page but haven't mapped it > >>> * yet. > >>> */ > >>> - swap_free_nr(entry, 1); > >>> + swap_free_nr(entry, nr_pages); > >>> if (should_try_to_free_swap(folio, vma, vmf->flags)) > >>> folio_free_swap(folio); > >>> > >>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > >>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS); > >>> + folio_ref_add(folio, nr_pages - 1); > >>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); > >>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); > >>> pte = mk_pte(page, vma->vm_page_prot); > >>> > >>> /* > >>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>> * exclusivity. > >>> */ > >>> if (!folio_test_ksm(folio) && > >>> - (exclusive || folio_ref_count(folio) == 1)) { > >>> + (exclusive || (folio_ref_count(folio) == nr_pages && > >>> + folio_nr_pages(folio) == nr_pages))) { > >> > >> I think in practice there is no change here? If nr_pages > 1 then the folio is > >> in the swapcache, so there is an extra ref on it? I agree with the change for > >> robustness sake. Just checking my understanding. > > > > This is the code showing we are reusing/(mkwrite) a folio either > > 1. we meet a small folio and we are the only one hitting the small folio > > 2. we meet a large folio and we are the only one hitting the large folio > > > > any corner cases besides the above two seems difficult. for example, > > > > while we hit a large folio in swapcache but if we can't entirely map it > > (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio) > > == nr_pages == 1 > > No, there would be other references from the swapcache and > folio_ref_count(folio) > 1. See my other reply. right. can be clearer by: @@ -4263,7 +4264,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (should_try_to_free_swap(folio, vma, vmf->flags)) folio_free_swap(folio); - folio_ref_add(folio, nr_pages - 1); add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); pte = mk_pte(page, vma->vm_page_prot); @@ -4275,14 +4275,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * exclusivity. */ if (!folio_test_ksm(folio) && - (exclusive || (folio_ref_count(folio) == nr_pages && - folio_nr_pages(folio) == nr_pages))) { + (exclusive || folio_ref_count(folio) == 1)) { if (vmf->flags & FAULT_FLAG_WRITE) { pte = maybe_mkwrite(pte_mkdirty(pte), vma); vmf->flags &= ~FAULT_FLAG_WRITE; } rmap_flags |= RMAP_EXCLUSIVE; } + folio_ref_add(folio, nr_pages - 1); flush_icache_pages(vma, page, nr_pages); if (pte_swp_soft_dirty(vmf->orig_pte)) pte = pte_mksoft_dirty(pte); > > -- > Cheers, > > David / dhildenb >