Re: Splitting the mmap_sem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Dec 08, 2019 at 07:33:35PM -0800, Matthew Wilcox wrote:
> On Fri, Dec 06, 2019 at 12:30:30PM -0500, Jerome Glisse wrote:
> > On Thu, Dec 05, 2019 at 09:13:22PM -0800, Matthew Wilcox wrote:
> > > On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote:
> > > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > > > > While one thread is calling mmap(MAP_FIXED), two other threads which are
> > > > > accessing the same address may see different data from each other and
> > > > > have different page translations in their respective CPU caches until
> > > > > the thread calling mmap() returns.  I believe this is OK, but would
> > > > > greatly appreciate hearing from people who know better.
> > > > 
> > > > I do not believe this is OK, i believe this is wrong (not even considering
> > > > possible hardware issues that can arise from such aliasing).
> > > 
> > > Well, OK, but why do you believe it is wrong?  If thread A is executing
> > > a load instruction at the same time that thread B is calling mmap(),
> > > it really is indeterminate what value A loads.  It might be from before
> > > the call to mmap() and it might be from after.  And if thread C is also
> > > executing a load instruction at the same time, then it might already get
> > > a different result from thread A.  And can threads A and C really tell
> > > which of them executed the load instruction 'first'?  I think this is
> > > all so indeterminate already that the (lack of) guarantees I outlined
> > > above are acceptable.
> > > 
> > > But we should all agree on this, so _please_ continue to argue your case
> > > for why you believe it to be wrong.
> > 
> > I agree that such application might looks like it is doing something that
> > is undeterminate but their might be application that catch SEGFAULT and use
> > it as synchronization. I did something similar for reverse engineering a
> > long time ago with a library call libsegfault ...
> > 
> > In any case, i agree that an application that is not catching SEGFAULT, and
> > which is doing the above (access patterns) is doing something undeterminate.
> > 
> > Nonetheless i believe it is important that at any point in time for all the
> > threads in a given process, on all the CPUs, a given virtual address should
> > always point to the same physical memory (or to nothing) ie we should never
> > have one CPU that sees a different physical memory from another CPU for the
> > same virtual address.
> > 
> > Well i feel like you are also not discussing about the munmap() the above
> > seemed to be about MAP_FIXED (replacing an existing mapping). For munmap
> > too i believe we should agree on what should be the expected behavior and
> > from my POV again we should not allow new mapping to appear until a "running"
> > munmap is not fully done (ie all CPUs cache and TLB flushed). For the same
> > reason as above ie all CPUs always see same physical memory (or nothing) for
> > a given virtual address.
> 
> I see MAP_FIXED as being the harder case, but sure, let's talk about
> munmap!  I agree that a munmap() + mmap() call should not permit thread
> B to see the old value after thread A has seen the new value.  But,
> as long as no new mmap can occupy that range, then it's OK if thread A
> takes a segfault while thread B can still load the old value.  At least
> for a short window.
> 
> We can replicate that behaviour by ensuring that new lookups see a NULL
> entry, but new attempts to allocate will not reuse that range until the
> munmap has finished and all TLB entries are flushed.  The maple tree
> actually supports a "ZERO" entry (just like the XArray does) which has
> this behaviour -- lookups see NULL, but attempts to allocate do not see
> it as free.  We already use that property to prevent allocating above
> the end of the process address space.
> 
> > This is what we have today with the big rwsem and i think we need to keep
> > that behavior even with concurency. I do not believe this will impact the
> > performance and it is easy enough to solve so i feel safer doing so given
> > it does not cost anything.
> > 
> > So i would rather argue on why we should change the current behavior if we
> > can fix the concurrency without changing it (hence why discussing solution
> > might also be relevant here).
> 
> It seems like you want to force a thread which sees an ongoing munmap
> to spin or sleep until the munmap is done, rather than immediately take
> a segfault, and I don't know that's a useful behaviour.

No, for munmap() i want the today behavior modulo concurrency ie:
    - any concurrent fault while munmap is on going should SEGFAULT
    - any new mmap might install itself in the unmaped area but can not
      service fault until the munmap is fully done (ie all cache and TLB
      are flush)

So it seems we agree on that given your above presentation of mapple NULL
entry.

I want the spin only for fault in a MAP_FIXED range that replace an existing
range ie which is what we have today. Today if you do a mmap MAP_FIXED you
will block concurrent fault until the rwsem is release. I argue we should
keep that behavior as any program that keep accessing such area while mmap
is in progress is doing something peculiar (it can be a bug or it can be
something totaly intentional and use in conjunction with signal handlers).

This will only affect such application and it will keep coherency for TLBs
accross all CPUs. The implementation cost can be kept very low and it does
not impede concurrency on other virtual range.

Keeping existing behavior just feels a lot safer to me and it avoids us
pondering on wether or not a CPU can freakout if they see different TLB
for same virtual address.


> > > > Just to make sure i understand, you propose that ->map_pages() becomes
> > > > a new ->fault() handler that get calls before ->fault() without refcount
> > > > so that we can update fs/drivers slowly to perform better in the new scheme
> > > > (ie avoid the overead of refcounting if possible at all) ?
> > > > 
> > > > The ->fault() callback would then be the "slow" path which will require
> > > > a refcount on the vma (taken by core mm code before dropping rcu lock).
> > > 
> > > I would actually propose never updating most drivers.  There's just no
> > > need for them to handle such an unstable and tricky situation as this.
> > > Let's not make driver writers lives harder.
> > > 
> > > For the ones which need this kind of scalability (and let's be clear, they
> > > would already have *better* scalability than today due to the rwsem being
> > > split into a per-VMA refcount), then yes, implementing ->map_pages would
> > > be the way to go.  Indeed, they would probably benefit from implementing
> > > it today, since it will reduce the number of page faults.
> > 
> > Yes they will get better scalability but i see some of those drivers might
> > want the extra few mini-percent :) In any case, i believe that it might be
> > better to give a new name ie fix current map_pages() user and rename that
> > callback to something more explicit (atomic_map_pages() or something similar
> > i am not good at naming). But otherwise this looks like a good plan to avoid
> > excessive refcount overhead.
> 
> OK, great.  I don't think the current name is bad, but if someone comes
> up with a better one, I don't have a problem with renaming it.

I just like when new behavior are reflected in name, what about rcu_map_pages ?
Making it clear it is under rcu.

Cheers,
Jérôme






[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux