Re: [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)

Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> · Sat, 9 Nov 2024 11:07:56 +0800

On 2024/11/9 02:04, Jann Horn wrote:
On Fri, Nov 8, 2024 at 8:13 AM Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote:
On 2024/11/8 07:35, Jann Horn wrote:
On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote:
As a first step, this commit aims to synchronously free the empty PTE
pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
cases other than madvise(MADV_DONTNEED).

Once an empty PTE is detected, we first try to hold the pmd lock within
the pte lock. If successful, we clear the pmd entry directly (fast path).
Otherwise, we wait until the pte lock is released, then re-hold the pmd
and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
whether the PTE page is empty and free it (slow path).

How does this interact with move_pages_pte()? I am looking at your
series applied on top of next-20241106, and it looks to me like
move_pages_pte() uses pte_offset_map_rw_nolock() and assumes that the
PMD entry can't change. You can clearly see this assumption at the
WARN_ON_ONCE(pmd_none(*dst_pmd)). And if we race the wrong way, I

In move_pages_pte(), the following conditions may indeed be triggered:

         /* Sanity checks before the operation */
         if (WARN_ON_ONCE(pmd_none(*dst_pmd)) || WARN_ON_ONCE(pmd_none(*src_pmd)) ||
             WARN_ON_ONCE(pmd_trans_huge(*dst_pmd)) ||
WARN_ON_ONCE(pmd_trans_huge(*src_pmd))) {
                 err = -EINVAL;
                 goto out;
         }

But maybe we can just remove the WARN_ON_ONCE(), because...

think for example move_present_pte() can end up moving a present PTE
into a page table that has already been scheduled for RCU freeing.

...this situation is impossible to happen. Before performing move
operation, the pte_same() check will be performed after holding the
pte lock, which can ensure that the PTE page is stable:

CPU 0                    CPU 1

zap_pte_range

                         orig_src_pte = ptep_get(src_pte);

pmd_lock
pte_lock
check if all PTEs are pte_none()
--> clear pmd entry
unlock pte
unlock pmd

                         src_pte_lock
                         pte_same(orig_src_pte, ptep_get(src_pte))
                         --> return false and will skip the move op

Yes, that works for the source PTE. But what about the destination?

Operations on the destination PTE in move_pages_pte() are, when moving
a normal present source PTE pointing to an order-0 page, and assuming
that the optimistic folio_trylock(src_folio) and
anon_vma_trylock_write(src_anon_vma) succeed:

dst_pte = pte_offset_map_rw_nolock(mm, dst_pmd, dst_addr,
&dummy_pmdval, &dst_ptl)
[check that dst_pte is non-NULL]
some racy WARN_ON_ONCE() checks
spin_lock(dst_ptl);
orig_dst_pte = ptep_get(dst_pte);
spin_unlock(dst_ptl);
[bail if orig_dst_pte isn't none]
double_pt_lock(dst_ptl, src_ptl)
[check pte_same(ptep_get(dst_pte), orig_dst_pte)]

and then we're past the point of no return. Note that there is a
pte_same() check against orig_dst_pte, but pte_none(orig_dst_pte) is
intentionally pte_none(), so the pte_same() check does not guarantee
that the destination page table is still linked in.

OK, now I got what you mean. This is indeed a problem. In this case,
it is still necessary to recheck pmd_same() to ensure the stability
of dst_pte page. Will fix it.

diff --git a/mm/memory.c b/mm/memory.c
index 002aa4f454fa0..c4a8c18fbcfd7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
   static inline bool should_zap_cows(struct zap_details *details)
   {
          /* By default, zap all pages */
-       if (!details)
+       if (!details || details->reclaim_pt)
                  return true;

          /* Or, we zap COWed pages only if the caller wants to */

This looks hacky - when we have a "details" object, its ->even_cows
member is supposed to indicate whether COW pages should be zapped. So
please instead set .even_cows=true in madvise_dontneed_single_vma().

But the details->reclaim_pt should continue to be set, right? Because
we need to use .reclaim_pt to indicate whether the empty PTE page
should be reclaimed.

Yeah, you should set both.

OK.

Thanks!