Re: Splitting the mmap_sem

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Thu, 12 Dec 2019 07:40:02 -0800

On Thu, Dec 12, 2019 at 05:24:57PM +0300, Kirill A. Shutemov wrote:
> On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > My preferred solution to the mmap_sem scalability problem is to allow
> > VMAs to be looked up under the RCU read lock then take a per-VMA lock.
> > I've been focusing on the first half of this problem (looking up VMAs
> > in an RCU-safe data structure) and ignoring the second half (taking a
> > lock while holding the RCU lock).
> 
> Do you see this approach to be regression-free for uncontended case?
> I doubt it will not cause regressions for signle-threaded applications...

Which part of the approach do you think will cause a regression?  The
maple tree is quicker to traverse than the rbtree (in our simulations).
Incrementing a refcount on a VMA is surely no slower than acquiring an
uncontended rwsem for read.  mmap() and munmap() will get slower, but is
that a problem?

> > We currently only have one ->map_pages() callback, and it's
> > filemap_map_pages().  It only needs to sleep in one place -- to allocate
> > a PTE table.  I think that can be allocated ahead of time if needed.
> 
> No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the
> function is under rcu_read_lock(). It uses pre-allocated page table.
> See do_fault_around().

Oh, thank you!  That makes the ->map_pages() optimisation already workable
with no changes.