Splitting the mmap_sem

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Tue, 3 Dec 2019 14:21:47 -0800

[My thanks to Vlastimil, Michel, Liam, David, Davidlohr and Hugh for
 their feedback on an earlier version of this.  I think the solution
 we discussed doesn't quite work, so here's one which I think does.
 See the last two paragraphs in particular.]

My preferred solution to the mmap_sem scalability problem is to allow
VMAs to be looked up under the RCU read lock then take a per-VMA lock.
I've been focusing on the first half of this problem (looking up VMAs
in an RCU-safe data structure) and ignoring the second half (taking a
lock while holding the RCU lock).

We can't take a semaphore while holding the RCU lock in case we have to
sleep -- the VMA might not exist any more when we woke up.  Making the
per-VMA lock a spinlock would be a massive change -- fault handlers are
currently called with the mmap_sem held and may sleep.  So I think we
need a per-VMA refcount.  That lets us sleep while handling a fault.
There are over 100 fault handlers in the kernel, and I don't want to
change the locking in all of them.

That makes modifications to the tree a little tricky.  At the moment,
we take the rwsem for write which waits for all readers to finish, then
we modify the VMAs, then we allow readers back in.  With RCU, there is
no way to block readers, so different threads may (at the same time)
see both an old and a new VMA for the same virtual address.

So calling mmap() looks like this:

        allocate a new VMA
        update pointer(s) in maple tree
        sleep until old VMAs have a zero refcount
        synchronize_rcu()
        free old VMAs
        flush caches for affected range
        return to userspace

While one thread is calling mmap(MAP_FIXED), two other threads which are
accessing the same address may see different data from each other and
have different page translations in their respective CPU caches until
the thread calling mmap() returns.  I believe this is OK, but would
greatly appreciate hearing from people who know better.

Some people are concerned that a reference count on the VMA will lead to
contention moving from the mmap_sem to the refcount on a very large VMA
for workloads which have one giant VMA covering the entire working set.
For those workloads, I propose we use the existing ->map_pages() callback
(changed to return a vm_fault_t from the current void).

It will be called with the RCU lock held and no reference count on
the vma.  If it needs to sleep, it should bump the refcount, drop the
RCU lock, prepare enough so that the next call will not need to sleep,
then drop the refcount and return VM_FAULT_RETRY so the VM knows the
VMA is no longer good, and it needs to walk the VMA tree from the start.

We currently only have one ->map_pages() callback, and it's
filemap_map_pages().  It only needs to sleep in one place -- to allocate
a PTE table.  I think that can be allocated ahead of time if needed.