On Thu, Jan 23, 2025 at 04:48:10PM +0000, Lorenzo Stoakes wrote: > +cc vbabka Sorry my mail client messing up, forgot to actually cc... fixed :) > > I realise this is resurrecting an old thread, but helpful to cc- us mmap > maintainers as my mail at least is nightmare :P Thanks! > > On Thu, Jan 23, 2025 at 05:14:27PM +1300, Barry Song wrote: > > > All userfaultfd operations, except write-protect, opportunistically use > > > per-vma locks to lock vmas. On failure, attempt again inside mmap_lock > > > critical section. > > > > > > Write-protect operation requires mmap_lock as it iterates over multiple > > > vmas. > > > > Hi Lokesh, > > > > Apologies for reviving this old thread. We truly appreciate the excellent work > > you’ve done in transitioning many userfaultfd operations to per-VMA locks. > > Let me also say - thanks Lokesh! > > > > > However, we’ve noticed that userfaultfd still remains one of the largest users > > of mmap_lock for write operations, with the other—binder—having been recently > > addressed by Carlos Llamas's "binder: faster page installations" series: > > > > https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@xxxxxxxxxx/ > > > > The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_register() > > and userfaultfd_unregister() operations, both of which require the mmap_lock > > in write mode to either split or merge VMAs. Since HeapTaskDaemon is a > > lower-priority background task, there are cases where, after acquiring the > > mmap_lock, it gets preempted by other tasks. As a result, even high-priority > > threads waiting for the mmap_lock — whether in writer or reader mode—can > > end up experiencing significant delays(The delay can reach several hundred > > milliseconds in the worst case.) > > Thanks for reporting - this strikes me as an important report, but I'm not > sure about your proposed solution :) > > However I wonder if we can't look at more efficiently handling the locking > around uffd register/unregister? > > I think uffd suffers from a complexity problem, in that there are multiple > moving parts and complicated interactions especially for each of the > different kinds of events uffd can deal with. > > Also ordering matters a great deal within uffd. Ideally we'd not have uffd > effectively have open-coded hooks all over the mm code, but that ship > sailed long ago and so we need to really carefully assess how locking > changes impacts uffd behaviour. > > > > > We haven’t yet identified an ideal solution for this. However, the Java heap > > appears to behave like a "volatile" vma in its usage. A somewhat simplistic > > idea would be to designate a specific region of the user address space as > > "volatile" and restrict all "volatile" VMAs to this isolated region. > > > > We may have a MAP_VOLATILE flag to mmap. VMA regions with this flag will be > > mapped to the volatile space, while those without it will be mapped to the > > non-volatile space. > > This feels like a major thing to do just to suit specific uffd usages, a > feature that not everybody makes use of. > > The virtual address space is a somewhat sensitive subject (see the > relatively recent discussion on a proposed MAP_BELOW flag), and has a lot > of landmines you can step on. > > How would this range be determined? How would this interact with people > doing 'interesting' things with MAP_FIXED mappings? What if somebody > MAP_FIXED into a volatile region and gets this behaviour by mistake? > > How do the two locks interact with one another and vma locks? I mean we > already have a very complicated set of interactions between the mmap sem > and vma locks where the mmap sem is taken to lock the mmap (of course), but > now we'd have two? > > Also you have to write lock when you modify the VMA tree, would we have two > maple trees now? And what about the overhead? > > I feel like this is not the right route for this. > > > > > ┌────────────┐TASK_SIZE > > │ │ > > │ │ > > │ │mmap VOLATILE > > ┼────────────┤ > > │ │ > > │ │ > > │ │ > > │ │ > > │ │default mmap > > │ │ > > │ │ > > └────────────┘ > > > > VMAs in the volatile region are assigned their own volatile_mmap_lock, > > which is independent of the mmap_lock for the non-volatile region. > > Additionally, we ensure that no single VMA spans the boundary between > > the volatile and non-volatile regions. This separation prevents the > > frequent modifications of a small number of volatile VMAs from blocking > > other operations on a large number of non-volatile VMAs. > > I think really overall this will be solving one can of worms by introducing > another can of very large worms in space :P but perhaps I am missing > details here. > > > > > The implementation itself wouldn’t be overly complex, but the design > > might come across as somewhat hacky. > > > > Lastly, I have two questions: > > > > 1. Have you observed similar issues where userfaultfd continues to > > cause lock contention and priority inversion? > > > > 2. If so, do you have any ideas or suggestions on how to address this > > problem? > > Addressing and investigating this however is VERY IMPORTANT. Let's perhaps > investigate how we might find a uffd-specific means of addressing these > problems. > > > > > > > > > Signed-off-by: Lokesh Gidra <lokeshgidra@xxxxxxxxxx> > > > Reviewed-by: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> > > > --- > > > fs/userfaultfd.c | 13 +- > > > include/linux/userfaultfd_k.h | 5 +- > > > mm/huge_memory.c | 5 +- > > > mm/userfaultfd.c | 380 ++++++++++++++++++++++++++-------- > > > 4 files changed, 299 insertions(+), 104 deletions(-) > > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > index c00a021bcce4..60dcfafdc11a 100644 > > > --- a/fs/userfaultfd.c > > > +++ b/fs/userfaultfd.c > > > > Thanks > > Barry > > > > Cheers, Lorenzo