On Mon, May 24, 2021 at 10:44 PM A lneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxx> wrote: > > Should we worry about the below race. The window would be small > > CPU 1 CPU 2 CPU 3 > > mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one > > mmap_write_lock_killable() > > addr = old_addr > > lock(pmd_ptl) > pmd = *old_pmd > pmd_clear(old_pmd) > flush_tlb_range(old_addr) > > lock(pte_ptl) > *new_pmd = pmd > unlock(pte_ptl) > > unlock(pmd_ptl) > lock(pte_ptl) > *new_addr = 10; and fills > TLB with new addr > and old pfn > > ptep_clear_flush(old_addr) > old pfn is free. > Stale TLB entry Hmm. Do you need a third CPU there? What is done above on CPU3 looks like it might just be CPU1 accessing the new range immediately. Which doesn't actually sound at all unlikely - so maybe the window is small, but it sounds like something that could happen. This looks nasty. The page shrinker has always been problematic because it basically avoids the normal full set of locks. I wonder if we could just make the page shrinker try-lock the mmap_sem and avoid all this that way. It _is_ allowed to fail, after all, and the page shrinker is "not normal" and should be less of a performance issue than all the actual normal VM paths. Does anybody have any good ideas? > > And new optimization for empty pmd, which seems unrelated to the > > change and should presumably be separate: > > That was added that we can safely do pte_lockptr() below Oh, because pte_lockptr() doesn't actually use the "old_pmd" pointer value - it actually *dereferences* the pointer. That looks like a mis-design. Why does it do that? Why don't we pass it the pmd value, if that's what it wants? Linus