Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()

Dev Jain <dev.jain@xxxxxxx> · Wed, 18 Dec 2024 21:29:48 +0530

On 16/12/24 10:36 pm, David Hildenbrand wrote:
On 16.12.24 17:51, Dev Jain wrote:
In contrast to PMD-collapse, we do not need to operate on two levels 
of pagetable
simultaneously. Therefore, downgrade the mmap lock from write to read 
mode. Still
take the anon_vma lock in exclusive mode so as to not waste time in 
the rmap path,
which is anyways going to fail since the PTEs are going to be 
changed. Under the PTL,
copy page contents, clear the PTEs, remove folio pins, and (try to) 
unmap the
old folios. Set the PTEs to the new folio using the set_ptes() API.

Signed-off-by: Dev Jain <dev.jain@xxxxxxx>
---
Note: I have been trying hard to get rid of the locks in here: we 
still are
taking the PTL around the page copying; dropping the PTL and taking 
it after
the copying should lead to a deadlock, for example:
khugepaged                        madvise(MADV_COLD)
folio_lock()                        lock(ptl)
lock(ptl)                        folio_lock()

We can create a locked folio list, altogether drop both the locks, 
take the PTL,
do everything which __collapse_huge_page_isolate() does *except* the 
isolation and
again try locking folios, but then it will reduce efficiency of 
khugepaged
and almost looks like a forced solution :)
Please note the following discussion if anyone is interested:
https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@xxxxxxx/ 

(Apologies for not CCing the mailing list from the start)

  mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++----------
  1 file changed, 87 insertions(+), 21 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 88beebef773e..8040b130e677 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -714,24 +714,28 @@ static void 
__collapse_huge_page_copy_succeeded(pte_t *pte,
                          struct vm_area_struct *vma,
                          unsigned long address,
                          spinlock_t *ptl,
-                        struct list_head *compound_pagelist)
+                        struct list_head *compound_pagelist, int order)
  {
      struct folio *src, *tmp;
      pte_t *_pte;
      pte_t pteval;
  -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+    for (_pte = pte; _pte < pte + (1UL << order);
           _pte++, address += PAGE_SIZE) {
          pteval = ptep_get(_pte);
          if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
              add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
              if (is_zero_pfn(pte_pfn(pteval))) {
-                /*
-                 * ptl mostly unnecessary.
-                 */
-                spin_lock(ptl);
-                ptep_clear(vma->vm_mm, address, _pte);
-                spin_unlock(ptl);
+                if (order == HPAGE_PMD_ORDER) {
+                    /*
+                    * ptl mostly unnecessary.
+                    */
+                    spin_lock(ptl);
+                    ptep_clear(vma->vm_mm, address, _pte);
+                    spin_unlock(ptl);
+                } else {
+                    ptep_clear(vma->vm_mm, address, _pte);
+                }
                  ksm_might_unmap_zero_page(vma->vm_mm, pteval);
              }
          } else {
@@ -740,15 +744,20 @@ static void 
__collapse_huge_page_copy_succeeded(pte_t *pte,
              src = page_folio(src_page);
              if (!folio_test_large(src))
                  release_pte_folio(src);
-            /*
-             * ptl mostly unnecessary, but preempt has to
-             * be disabled to update the per-cpu stats
-             * inside folio_remove_rmap_pte().
-             */
-            spin_lock(ptl);
-            ptep_clear(vma->vm_mm, address, _pte);
-            folio_remove_rmap_pte(src, src_page, vma);
-            spin_unlock(ptl);
+            if (order == HPAGE_PMD_ORDER) {
+                /*
+                * ptl mostly unnecessary, but preempt has to
+                * be disabled to update the per-cpu stats
+                * inside folio_remove_rmap_pte().
+                */
+                spin_lock(ptl);
+                ptep_clear(vma->vm_mm, address, _pte);




+ folio_remove_rmap_pte(src, src_page, vma);
+                spin_unlock(ptl);
+            } else {
+                ptep_clear(vma->vm_mm, address, _pte);
+                folio_remove_rmap_pte(src, src_page, vma);
+            }

As I've talked to Nico about this code recently ... :)

Are you clearing the PTE after the copy succeeded? If so, where is the 
TLB flush?

How do you sync against concurrent write acess + GUP-fast?


The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check 
if there are unexpected page references (e.g., GUP) if so back off (3) 
copy page content (4) set updated PTE/PMD.

To Nico, I suggested doing it simple initially, and still clear the 
high-level PMD entry + flush under mmap write lock, then re-map the 
PTE table after modifying the page table. It's not as efficient, but 
"harder to get wrong".

Btw if we are taking mmap write lock, then we do not require flushing 
PMD entry for mTHP collapse right?

Maybe that's already happening, but I stumbled over this clearing 
logic in __collapse_huge_page_copy_succeeded(), so I'm curious.