+cc vbabka I realise this is resurrecting an old thread, but helpful to cc- us mmap maintainers as my mail at least is nightmare :P Thanks! On Thu, Jan 23, 2025 at 05:14:27PM +1300, Barry Song wrote: > > All userfaultfd operations, except write-protect, opportunistically use > > per-vma locks to lock vmas. On failure, attempt again inside mmap_lock > > critical section. > > > > Write-protect operation requires mmap_lock as it iterates over multiple > > vmas. > > Hi Lokesh, > > Apologies for reviving this old thread. We truly appreciate the excellent work > you’ve done in transitioning many userfaultfd operations to per-VMA locks. Let me also say - thanks Lokesh! > > However, we’ve noticed that userfaultfd still remains one of the largest users > of mmap_lock for write operations, with the other—binder—having been recently > addressed by Carlos Llamas's "binder: faster page installations" series: > > https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@xxxxxxxxxx/ > > The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_register() > and userfaultfd_unregister() operations, both of which require the mmap_lock > in write mode to either split or merge VMAs. Since HeapTaskDaemon is a > lower-priority background task, there are cases where, after acquiring the > mmap_lock, it gets preempted by other tasks. As a result, even high-priority > threads waiting for the mmap_lock — whether in writer or reader mode—can > end up experiencing significant delays(The delay can reach several hundred > milliseconds in the worst case.) Thanks for reporting - this strikes me as an important report, but I'm not sure about your proposed solution :) However I wonder if we can't look at more efficiently handling the locking around uffd register/unregister? I think uffd suffers from a complexity problem, in that there are multiple moving parts and complicated interactions especially for each of the different kinds of events uffd can deal with. Also ordering matters a great deal within uffd. Ideally we'd not have uffd effectively have open-coded hooks all over the mm code, but that ship sailed long ago and so we need to really carefully assess how locking changes impacts uffd behaviour. > > We haven’t yet identified an ideal solution for this. However, the Java heap > appears to behave like a "volatile" vma in its usage. A somewhat simplistic > idea would be to designate a specific region of the user address space as > "volatile" and restrict all "volatile" VMAs to this isolated region. > > We may have a MAP_VOLATILE flag to mmap. VMA regions with this flag will be > mapped to the volatile space, while those without it will be mapped to the > non-volatile space. This feels like a major thing to do just to suit specific uffd usages, a feature that not everybody makes use of. The virtual address space is a somewhat sensitive subject (see the relatively recent discussion on a proposed MAP_BELOW flag), and has a lot of landmines you can step on. How would this range be determined? How would this interact with people doing 'interesting' things with MAP_FIXED mappings? What if somebody MAP_FIXED into a volatile region and gets this behaviour by mistake? How do the two locks interact with one another and vma locks? I mean we already have a very complicated set of interactions between the mmap sem and vma locks where the mmap sem is taken to lock the mmap (of course), but now we'd have two? Also you have to write lock when you modify the VMA tree, would we have two maple trees now? And what about the overhead? I feel like this is not the right route for this. > > ┌────────────┐TASK_SIZE > │ │ > │ │ > │ │mmap VOLATILE > ┼────────────┤ > │ │ > │ │ > │ │ > │ │ > │ │default mmap > │ │ > │ │ > └────────────┘ > > VMAs in the volatile region are assigned their own volatile_mmap_lock, > which is independent of the mmap_lock for the non-volatile region. > Additionally, we ensure that no single VMA spans the boundary between > the volatile and non-volatile regions. This separation prevents the > frequent modifications of a small number of volatile VMAs from blocking > other operations on a large number of non-volatile VMAs. I think really overall this will be solving one can of worms by introducing another can of very large worms in space :P but perhaps I am missing details here. > > The implementation itself wouldn’t be overly complex, but the design > might come across as somewhat hacky. > > Lastly, I have two questions: > > 1. Have you observed similar issues where userfaultfd continues to > cause lock contention and priority inversion? > > 2. If so, do you have any ideas or suggestions on how to address this > problem? Addressing and investigating this however is VERY IMPORTANT. Let's perhaps investigate how we might find a uffd-specific means of addressing these problems. > > > > > Signed-off-by: Lokesh Gidra <lokeshgidra@xxxxxxxxxx> > > Reviewed-by: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> > > --- > > fs/userfaultfd.c | 13 +- > > include/linux/userfaultfd_k.h | 5 +- > > mm/huge_memory.c | 5 +- > > mm/userfaultfd.c | 380 ++++++++++++++++++++++++++-------- > > 4 files changed, 299 insertions(+), 104 deletions(-) > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > index c00a021bcce4..60dcfafdc11a 100644 > > --- a/fs/userfaultfd.c > > +++ b/fs/userfaultfd.c > > Thanks > Barry > Cheers, Lorenzo