On Thu, Dec 05, 2019 at 09:13:22PM -0800, Matthew Wilcox wrote: > On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote: > > Adding few interested people in cc > > I figured they all read linux-mm already ;-) > > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote: > > > While one thread is calling mmap(MAP_FIXED), two other threads which are > > > accessing the same address may see different data from each other and > > > have different page translations in their respective CPU caches until > > > the thread calling mmap() returns. I believe this is OK, but would > > > greatly appreciate hearing from people who know better. > > > > I do not believe this is OK, i believe this is wrong (not even considering > > possible hardware issues that can arise from such aliasing). > > Well, OK, but why do you believe it is wrong? If thread A is executing > a load instruction at the same time that thread B is calling mmap(), > it really is indeterminate what value A loads. It might be from before > the call to mmap() and it might be from after. And if thread C is also > executing a load instruction at the same time, then it might already get > a different result from thread A. And can threads A and C really tell > which of them executed the load instruction 'first'? I think this is > all so indeterminate already that the (lack of) guarantees I outlined > above are acceptable. > > But we should all agree on this, so _please_ continue to argue your case > for why you believe it to be wrong. > I agree that such application might looks like it is doing something that is undeterminate but their might be application that catch SEGFAULT and use it as synchronization. I did something similar for reverse engineering a long time ago with a library call libsegfault ... In any case, i agree that an application that is not catching SEGFAULT, and which is doing the above (access patterns) is doing something undeterminate. Nonetheless i believe it is important that at any point in time for all the threads in a given process, on all the CPUs, a given virtual address should always point to the same physical memory (or to nothing) ie we should never have one CPU that sees a different physical memory from another CPU for the same virtual address. > [snip proposed solution -- if the problem needs solving, we can argue > about how to solve it later] Well i feel like you are also not discussing about the munmap() the above seemed to be about MAP_FIXED (replacing an existing mapping). For munmap too i believe we should agree on what should be the expected behavior and from my POV again we should not allow new mapping to appear until a "running" munmap is not fully done (ie all CPUs cache and TLB flushed). For the same reason as above ie all CPUs always see same physical memory (or nothing) for a given virtual address. This is what we have today with the big rwsem and i think we need to keep that behavior even with concurency. I do not believe this will impact the performance and it is easy enough to solve so i feel safer doing so given it does not cost anything. So i would rather argue on why we should change the current behavior if we can fix the concurrency without changing it (hence why discussing solution might also be relevant here). > > > Some people are concerned that a reference count on the VMA will lead to > > > contention moving from the mmap_sem to the refcount on a very large VMA > > > for workloads which have one giant VMA covering the entire working set. > > > For those workloads, I propose we use the existing ->map_pages() callback > > > (changed to return a vm_fault_t from the current void). > > > > > > It will be called with the RCU lock held and no reference count on > > > the vma. If it needs to sleep, it should bump the refcount, drop the > > > RCU lock, prepare enough so that the next call will not need to sleep, > > > then drop the refcount and return VM_FAULT_RETRY so the VM knows the > > > VMA is no longer good, and it needs to walk the VMA tree from the start. > > > > Just to make sure i understand, you propose that ->map_pages() becomes > > a new ->fault() handler that get calls before ->fault() without refcount > > so that we can update fs/drivers slowly to perform better in the new scheme > > (ie avoid the overead of refcounting if possible at all) ? > > > > The ->fault() callback would then be the "slow" path which will require > > a refcount on the vma (taken by core mm code before dropping rcu lock). > > I would actually propose never updating most drivers. There's just no > need for them to handle such an unstable and tricky situation as this. > Let's not make driver writers lives harder. > > For the ones which need this kind of scalability (and let's be clear, they > would already have *better* scalability than today due to the rwsem being > split into a per-VMA refcount), then yes, implementing ->map_pages would > be the way to go. Indeed, they would probably benefit from implementing > it today, since it will reduce the number of page faults. Yes they will get better scalability but i see some of those drivers might want the extra few mini-percent :) In any case, i believe that it might be better to give a new name ie fix current map_pages() user and rename that callback to something more explicit (atomic_map_pages() or something similar i am not good at naming). But otherwise this looks like a good plan to avoid excessive refcount overhead. Cheers, Jérôme