Re: [RFC PATCH v3 3/4] mm: add do_set_pte_range()

David Hildenbrand <david@xxxxxxxxxx> · Fri, 3 Feb 2023 14:34:07 +0100

On 03.02.23 14:30, Yin, Fengwei wrote:


On 2/3/2023 9:25 PM, David Hildenbrand wrote:
On 03.02.23 14:16, Yin Fengwei wrote:
do_set_pte_range() allows to setup page table entries for a
specific range. It calls folio_add_file_rmap_range() to take
advantage of batched rmap update for large folio.

Signed-off-by: Yin Fengwei <fengwei.yin@xxxxxxxxx>
---
   include/linux/mm.h |  3 +++
   mm/filemap.c       |  1 -
   mm/memory.c        | 59 ++++++++++++++++++++++++++++++----------------
   3 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d6f8f41514cc..93192f04b276 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1162,6 +1162,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
     vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
   void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
+void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
+        unsigned long addr, pte_t *pte,
+        unsigned long start, unsigned int nr);
     vm_fault_t finish_fault(struct vm_fault *vmf);
   vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
diff --git a/mm/filemap.c b/mm/filemap.c
index f444684db9f2..74046a3a0ff5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3386,7 +3386,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
             ref_count++;
           do_set_pte(vmf, page, addr);
-        update_mmu_cache(vma, addr, vmf->pte);
       } while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
         /* Restore the vmf->pte */
diff --git a/mm/memory.c b/mm/memory.c
index 7a04a1130ec1..3754b2ef166a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4257,36 +4257,58 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
   }
   #endif
   -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
+void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
+        unsigned long addr, pte_t *pte,
+        unsigned long start, unsigned int nr)
   {
       struct vm_area_struct *vma = vmf->vma;
       bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
       bool write = vmf->flags & FAULT_FLAG_WRITE;
+    bool cow = write && !(vma->vm_flags & VM_SHARED);
       bool prefault = vmf->address != addr;
+    struct page *page = folio_page(folio, start);
       pte_t entry;
   -    flush_icache_page(vma, page);
-    entry = mk_pte(page, vma->vm_page_prot);
+    if (!cow) {
+        folio_add_file_rmap_range(folio, start, nr, vma, false);
+        add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
+    }
   -    if (prefault && arch_wants_old_prefaulted_pte())
-        entry = pte_mkold(entry);
-    else
-        entry = pte_sw_mkyoung(entry);
+    do {
+        flush_icache_page(vma, page);
+        entry = mk_pte(page, vma->vm_page_prot);
   -    if (write)
-        entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-    if (unlikely(uffd_wp))
-        entry = pte_mkuffd_wp(entry);
-    /* copy-on-write page */
-    if (write && !(vma->vm_flags & VM_SHARED)) {
+        if (prefault && arch_wants_old_prefaulted_pte())
+            entry = pte_mkold(entry);
+        else
+            entry = pte_sw_mkyoung(entry);
+
+        if (write)
+            entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+        if (unlikely(uffd_wp))
+            entry = pte_mkuffd_wp(entry);
+        set_pte_at(vma->vm_mm, addr, pte, entry);
+
+        /* no need to invalidate: a not-present page won't be cached */
+        update_mmu_cache(vma, addr, pte);
+    } while (pte++, page++, addr += PAGE_SIZE, --nr > 0);
+}
+
+void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
+{
+    struct folio *folio = page_folio(page);
+    struct vm_area_struct *vma = vmf->vma;
+    bool cow = (vmf->flags & FAULT_FLAG_WRITE) &&
+            !(vma->vm_flags & VM_SHARED);
+
+    if (cow) {
           inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
           page_add_new_anon_rmap(page, vma, addr);

As raised, we cannot PTE-map a multi-page folio that way.

This function only supports single-page anon folios.

page_add_new_anon_rmap() -> folio_add_new_anon_rmap(). As that documents:

"If the folio is large, it is accounted as a THP" -- for example, we would only increment the "entire mapcount" and set the PageAnonExclusive bit only on the head page.

So this really doesn't work for multi-page folios and if the function would be used for that, we'd be in trouble.

We'd want some fence here to detect that and bail out if we'd be instructed to do that. At least a WARN_ON_ONCE() I guess.
update_mmu_tlb(vma, vmf->address, vmf->pte);

Right now the function looks like it might just handle that.
You are right. I thought moving cow case out of it can make it explicit.
But looks like it doesn't. I will add WARN_ON_ONCE(). Thanks.

I guess I would move the cow check into  do_set_pte_range() as well, and 
verify in there that we are really only dealing with a single-page 
folio, commenting that rmap code would need serious adjustment to make 
it work and that current code never passes a multi-page folio.

Thanks!

--
Thanks,

David / dhildenb