Re: Splitting the mmap_sem

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Thu, 5 Dec 2019 21:13:22 -0800

On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote:
> Adding few interested people in cc

I figured they all read linux-mm already ;-)

> On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > While one thread is calling mmap(MAP_FIXED), two other threads which are
> > accessing the same address may see different data from each other and
> > have different page translations in their respective CPU caches until
> > the thread calling mmap() returns.  I believe this is OK, but would
> > greatly appreciate hearing from people who know better.
> 
> I do not believe this is OK, i believe this is wrong (not even considering
> possible hardware issues that can arise from such aliasing).

Well, OK, but why do you believe it is wrong?  If thread A is executing
a load instruction at the same time that thread B is calling mmap(),
it really is indeterminate what value A loads.  It might be from before
the call to mmap() and it might be from after.  And if thread C is also
executing a load instruction at the same time, then it might already get
a different result from thread A.  And can threads A and C really tell
which of them executed the load instruction 'first'?  I think this is
all so indeterminate already that the (lack of) guarantees I outlined
above are acceptable.

But we should all agree on this, so _please_ continue to argue your case
for why you believe it to be wrong.

[snip proposed solution -- if the problem needs solving, we can argue
about how to solve it later]

> > Some people are concerned that a reference count on the VMA will lead to
> > contention moving from the mmap_sem to the refcount on a very large VMA
> > for workloads which have one giant VMA covering the entire working set.
> > For those workloads, I propose we use the existing ->map_pages() callback
> > (changed to return a vm_fault_t from the current void).
> > 
> > It will be called with the RCU lock held and no reference count on
> > the vma.  If it needs to sleep, it should bump the refcount, drop the
> > RCU lock, prepare enough so that the next call will not need to sleep,
> > then drop the refcount and return VM_FAULT_RETRY so the VM knows the
> > VMA is no longer good, and it needs to walk the VMA tree from the start.
> 
> Just to make sure i understand, you propose that ->map_pages() becomes
> a new ->fault() handler that get calls before ->fault() without refcount
> so that we can update fs/drivers slowly to perform better in the new scheme
> (ie avoid the overead of refcounting if possible at all) ?
> 
> The ->fault() callback would then be the "slow" path which will require
> a refcount on the vma (taken by core mm code before dropping rcu lock).

I would actually propose never updating most drivers.  There's just no
need for them to handle such an unstable and tricky situation as this.
Let's not make driver writers lives harder.

For the ones which need this kind of scalability (and let's be clear, they
would already have *better* scalability than today due to the rwsem being
split into a per-VMA refcount), then yes, implementing ->map_pages would
be the way to go.  Indeed, they would probably benefit from implementing
it today, since it will reduce the number of page faults.