Re: Splitting the mmap_sem

Laurent Dufour <ldufour@xxxxxxxxxxxxx> · Wed, 19 Feb 2020 18:14:41 +0100

Le 07/02/2020 à 09:52, Peter Zijlstra a écrit :
On Thu, Feb 06, 2020 at 01:20:24PM -0800, Matthew Wilcox wrote:
On Thu, Feb 06, 2020 at 09:55:29PM +0100, Peter Zijlstra wrote:
On Thu, Feb 06, 2020 at 12:15:36PM -0800, Matthew Wilcox wrote:
then, at the beginning of a page fault call srcu_read_lock(&vma_srcu);
walk the tree as we do now, allocate memory for PTEs, sleep waiting for
pages to arrive back from disc, etc, etc, then at the end of the fault,
call srcu_read_unlock(&vma_srcu).

So far so good,...

munmap() would consist of removing the
VMA from the tree, then calling synchronize_srcu() to wait for all faults
to finish, then putting the backing file, etc, etc and freeing the VMA.

call_srcu(), and the (s)rcu callback will then fput() and such things
more.

synchronize_srcu() (like synchronize_rcu()) is stupid slow and would
make munmap()/exit()/etc.. unusable.

I'll need to think about that a bit.  I was convinced we needed to wait
for the current pagefaults to finish before we could return from munmap().
I need to convince myself that it's OK to return to userspace while the
page faults for that range are still proceeding on other CPUs.

File-io might be in progress, any actual faults will result in SIGFAULT
instead of installing a PTE.

It is not fundamentally different from any threaded uaf race.

This seems pretty reasonable, and investigation could actually proceed
before the Maple tree work lands.  Today, that would be:

srcu_read_lock(&vmas_srcu);
down_read(&mm->mmap_sem);
find_vma(mm, address);
up_read(&mm->mmap_sem);
... rest of fault handler path ...
srcu_read_unlock(&vmas_srcu);

Kind of a pain because we still call find_vma() in the per-arch page
fault handler, but for prototyping, we'd only have to do one or two
architectures.

If you look at the earlier speculative page-fault patches by Laurent,
which were based on my still earlier patches, you'll find most of this
there.

The tricky bit was validating everything on the second page-table walk,
so see if nothing had fundamentally changed, specifically the VMA,
before installing the PTE. If you do this without mmap_sem, you need to
hold ptlock to pin stuff while validating everything you did earlier.

The patches Laurent posted used regular RCU and a per-VMA refcount, not
SRCU.

That are his later patches, and I distinctly disagree with that
approach.

If you look at the patches here:

   https://lkml.kernel.org/r/cover.1479465699.git.ldufour@xxxxxxxxxxxxxxxxxx

you'll find it uses SRCU.

For the record, I switched from SRCU to RCU and a ref counter because using 
SRCU, the performances were impacted by the scheduling generated to handle 
SRCU asynchronous events.

I may have missed something, but using RCU and a ref counter was working 
around this 20% overhead.

If you use SRCU, why would you need a second page table walk?

Because SRCU only ensures the VMA object remains extant, it does not
prevent modification of it, normally that guarantee is provided by
mmap_sem, but we're not going to use that.

Instead, what we serialize on is the (split) ptlock. So we do the first
page-walk and ptlock to verify the vma-lookup, then we drop ptlock and
do the file-io, then we page-walk and take ptlock again, verify the vma
(again) and install the PTE. If anything goes wrong, we bail.

See this patch:

   https://lkml.kernel.org/r/301fb863785f37c319b493bd0d43167353871804.1479465699.git.ldufour@xxxxxxxxxxxxxxxxxx