[My thanks to Vlastimil, Michel, Liam, David, Davidlohr and Hugh for their feedback on an earlier version of this. I think the solution we discussed doesn't quite work, so here's one which I think does. See the last two paragraphs in particular.] My preferred solution to the mmap_sem scalability problem is to allow VMAs to be looked up under the RCU read lock then take a per-VMA lock. I've been focusing on the first half of this problem (looking up VMAs in an RCU-safe data structure) and ignoring the second half (taking a lock while holding the RCU lock). We can't take a semaphore while holding the RCU lock in case we have to sleep -- the VMA might not exist any more when we woke up. Making the per-VMA lock a spinlock would be a massive change -- fault handlers are currently called with the mmap_sem held and may sleep. So I think we need a per-VMA refcount. That lets us sleep while handling a fault. There are over 100 fault handlers in the kernel, and I don't want to change the locking in all of them. That makes modifications to the tree a little tricky. At the moment, we take the rwsem for write which waits for all readers to finish, then we modify the VMAs, then we allow readers back in. With RCU, there is no way to block readers, so different threads may (at the same time) see both an old and a new VMA for the same virtual address. So calling mmap() looks like this: allocate a new VMA update pointer(s) in maple tree sleep until old VMAs have a zero refcount synchronize_rcu() free old VMAs flush caches for affected range return to userspace While one thread is calling mmap(MAP_FIXED), two other threads which are accessing the same address may see different data from each other and have different page translations in their respective CPU caches until the thread calling mmap() returns. I believe this is OK, but would greatly appreciate hearing from people who know better. Some people are concerned that a reference count on the VMA will lead to contention moving from the mmap_sem to the refcount on a very large VMA for workloads which have one giant VMA covering the entire working set. For those workloads, I propose we use the existing ->map_pages() callback (changed to return a vm_fault_t from the current void). It will be called with the RCU lock held and no reference count on the vma. If it needs to sleep, it should bump the refcount, drop the RCU lock, prepare enough so that the next call will not need to sleep, then drop the refcount and return VM_FAULT_RETRY so the VM knows the VMA is no longer good, and it needs to walk the VMA tree from the start. We currently only have one ->map_pages() callback, and it's filemap_map_pages(). It only needs to sleep in one place -- to allocate a PTE table. I think that can be allocated ahead of time if needed.