Hi Jann,
On 2024/11/7 05:48, Jann Horn wrote:
On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote:
In retract_page_tables(), we may modify the pmd entry after acquiring the
pml and ptl, so we should also check whether the pmd entry is stable.
Why does taking the PMD lock not guarantee that the PMD entry is stable?
Because the pmd entry may have changed before taking the pmd lock, so we
need to recheck it after taking the pmd or pte lock.
Using pte_offset_map_rw_nolock() + pmd_same() to do it, and then we can
also remove the calling of the pte_lockptr().
Signed-off-by: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>
---
mm/khugepaged.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6f8d46d107b4b..6d76dde64f5fb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1721,6 +1721,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
spinlock_t *pml;
spinlock_t *ptl;
bool skipped_uffd = false;
+ pte_t *pte;
/*
* Check vma->anon_vma to exclude MAP_PRIVATE mappings that
@@ -1756,11 +1757,25 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
addr, addr + HPAGE_PMD_SIZE);
mmu_notifier_invalidate_range_start(&range);
+ pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pgt_pmd, &ptl);
+ if (!pte) {
+ mmu_notifier_invalidate_range_end(&range);
+ continue;
+ }
+
pml = pmd_lock(mm, pmd);
I don't understand why you're mapping the page table before locking
the PMD. Doesn't that just mean we need more error checking
afterwards?
The main purpose is to obtain the pmdval. If we don't use
pte_offset_map_rw_nolock, we should pay attention to recheck pmd entry
before pte_lockptr(), like this:
pmdval = pmdp_get_lockless(pmd);
pmd_lock
recheck pmdval
pte_lockptr(mm, pmd)
Otherwise, it may cause the system to crash. Consider the following
situation:
CPU 0 CPU 1
zap_pte_range
--> clear pmd entry
free pte page (by RCU)
retract_page_tables
--> pmd_lock
pte_lockptr(mm, pmd) <-- BOOM!!
So maybe calling pte_offset_map_rw_nolock() is more convenient.
Thanks,
Qi
- ptl = pte_lockptr(mm, pmd);
if (ptl != pml)
spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+ if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
+ pte_unmap_unlock(pte, ptl);
+ if (ptl != pml)
+ spin_unlock(pml);
+ mmu_notifier_invalidate_range_end(&range);
+ continue;
+ }
+ pte_unmap(pte);
+
/*
* Huge page lock is still held, so normally the page table
* must remain empty; and we have already skipped anon_vma
--
2.20.1