Re: [PATCH v6 updated 9/11] mm/mremap: Fix race between mremap and pageout

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Tue, 25 May 2021 07:22:34 -1000

On Mon, May 24, 2021 at 10:44 PM A lneesh Kumar K.V
<aneesh.kumar@xxxxxxxxxxxxx> wrote:
>
> Should we worry about the below race. The window would be small
>
> CPU 1                           CPU 2                                   CPU 3
>
> mremap(old_addr, new_addr)      page_shrinker/try_to_unmap_one
>
> mmap_write_lock_killable()
>
>                                 addr = old_addr
>
> lock(pmd_ptl)
> pmd = *old_pmd
> pmd_clear(old_pmd)
> flush_tlb_range(old_addr)
>
> lock(pte_ptl)
> *new_pmd = pmd
> unlock(pte_ptl)
>
> unlock(pmd_ptl)
>                                 lock(pte_ptl)
>                                                                         *new_addr = 10; and fills
>                                                                         TLB with new addr
>                                                                         and old pfn
>
>                                 ptep_clear_flush(old_addr)
>                                 old pfn is free.
>                                                                         Stale TLB entry

Hmm. Do you need a third CPU there? What is done above on CPU3 looks
like it might just be CPU1 accessing the new range immediately.

Which doesn't actually sound at all unlikely - so maybe the window is
small, but it sounds like something that could happen.

This looks nasty. The page shrinker has always been problematic
because it basically avoids the normal full set of locks.

I wonder if we could just make the page shrinker try-lock the mmap_sem
and avoid all this that way. It _is_ allowed to fail, after all, and
the page shrinker is "not normal" and should be less of a performance
issue than all the actual normal VM paths.

Does anybody have any good ideas?

> > And new optimization for empty pmd, which seems unrelated to the
> > change and should presumably be separate:
>
> That was added that we can safely do pte_lockptr() below

Oh, because pte_lockptr() doesn't actually use the "old_pmd" pointer
value - it actually *dereferences* the pointer.

That looks like a mis-design. Why does it do that? Why don't we pass
it the pmd value, if that's what it wants?

               Linus