On Tue, 8 Jun 2021, Aneesh Kumar K.V wrote: > > mm/mremap: hold the rmap lock in write mode when moving page table entries. > > To avoid a race between rmap walk and mremap, mremap does take_rmap_locks(). > The lock was taken to ensure that rmap walk don't miss a page table entry due to > PTE moves via move_pagetables(). The kernel does further optimization of > this lock such that if we are going to find the newly added vma after the > old vma, the rmap lock is not taken. This is because rmap walk would find the > vmas in the same order and if we don't find the page table attached to > older vma we would find it with the new vma which we would iterate later. > The actual lifetime of the page is still controlled by the PTE lock. > > This patch updates the locking requirement to handle another race condition > explained below with optimized mremap:: > > Optmized PMD move > > CPU 1 CPU 2 CPU 3 > > mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one > > mmap_write_lock_killable() > > addr = old_addr > lock(pte_ptl) > lock(pmd_ptl) > pmd = *old_pmd > pmd_clear(old_pmd) > flush_tlb_range(old_addr) > > *new_pmd = pmd > *new_addr = 10; and fills > TLB with new addr > and old pfn > > unlock(pmd_ptl) > ptep_clear_flush() > old pfn is free. > Stale TLB entry > The PUD example below is mainly a waste a space and time: "Optimized PUD move suffers from a similar race." would be better. > Optmized PUD move: > > CPU 1 CPU 2 CPU 3 > > mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one > > mmap_write_lock_killable() > > addr = old_addr > lock(pte_ptl) > lock(pud_ptl) > pud = *old_pud > pud_clear(old_pud) > flush_tlb_range(old_addr) > > *new_pud = pud > *new_addr = 10; and fills > TLB with new addr > and old pfn > > unlock(pud_ptl) > ptep_clear_flush() > old pfn is free. > Stale TLB entry > > Both the above race condition can be fixed if we force mremap path to take rmap lock. > Don't forget the Fixes and Link you had in the previous version: Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions") Link: https://lore.kernel.org/linux-mm/CAHk-=wgXVR04eBNtxQfevontWnP6FDm+oj5vauQXP3S-huwbPw@xxxxxxxxxxxxxx > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxx> Thanks, this is orders of magnitude better! Acked-by: Hugh Dickins <hughd@xxxxxxxxxx> > > diff --git a/mm/mremap.c b/mm/mremap.c > index 9cd352fb9cf8..f12df630fb37 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -517,7 +517,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, > } else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) { > > if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr, > - old_pud, new_pud, need_rmap_locks)) > + old_pud, new_pud, true)) > continue; > } > > @@ -544,7 +544,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, > * moving at the PMD level if possible. > */ > if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr, > - old_pmd, new_pmd, need_rmap_locks)) > + old_pmd, new_pmd, true)) > continue; > } > >