On Fri, Jan 24, 2025 at 2:45 AM Lokesh Gidra <lokeshgidra@xxxxxxxxxx> wrote: > > On Thu, Jan 23, 2025 at 8:52 AM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote: > > > > * Barry Song <21cnbao@xxxxxxxxx> [250122 23:14]: > > > > All userfaultfd operations, except write-protect, opportunistically use > > > > per-vma locks to lock vmas. On failure, attempt again inside mmap_lock > > > > critical section. > > > > > > > > Write-protect operation requires mmap_lock as it iterates over multiple > > > > vmas. > > > h > > > Hi Lokesh, > > > > > > Apologies for reviving this old thread. We truly appreciate the excellent work > > > you’ve done in transitioning many userfaultfd operations to per-VMA locks. > > > > > > However, we’ve noticed that userfaultfd still remains one of the largest users > > > of mmap_lock for write operations, with the other—binder—having been recently > > > addressed by Carlos Llamas's "binder: faster page installations" series: > > > > > > https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@xxxxxxxxxx/ > > > > > > The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_register() > > > and userfaultfd_unregister() operations, both of which require the mmap_lock > > > in write mode to either split or merge VMAs. Since HeapTaskDaemon is a > > > lower-priority background task, there are cases where, after acquiring the > > > mmap_lock, it gets preempted by other tasks. As a result, even high-priority > > > threads waiting for the mmap_lock — whether in writer or reader mode—can > > > end up experiencing significant delays(The delay can reach several hundred > > > milliseconds in the worst case.) > > Do you happen to have some trace that I can take a look at? We observed a rough trace in Android Studio showing the HeapTaskDaemon stuck in a runnable state after holding the mmap_lock for 1 second, while other threads were waiting for the lock. Our team will assist in collecting a detailed trace, but everyone is currently on an extended Chinese New Year holiday. Apologies, this may delay the process until after February 8. > > > > This needs an RFC or proposal or a discussion - certainly not a reply to > > an old v7 patch set. I'd want neon lights and stuff directing people to > > this topic. > > > > > > > > We haven’t yet identified an ideal solution for this. However, the Java heap > > > appears to behave like a "volatile" vma in its usage. A somewhat simplistic > > > idea would be to designate a specific region of the user address space as > > > "volatile" and restrict all "volatile" VMAs to this isolated region. > > > > I'm going to assume the uffd changes are in the volatile area? But > > really, maybe you mean the opposite.. I'll just assume I guessed > > correct here. Because, both sides of this are competing for the write > > lock. > > > > > > > > We may have a MAP_VOLATILE flag to mmap. VMA regions with this flag will be > > > mapped to the volatile space, while those without it will be mapped to the > > > non-volatile space. > > > > > > ┌────────────┐TASK_SIZE > > > │ │ > > > │ │ > > > │ │mmap VOLATILE > > > ┼────────────┤ > > > │ │ > > > │ │ > > > │ │ > > > │ │ > > > │ │default mmap > > > │ │ > > > │ │ > > > └────────────┘ > > > > No, this is way too complicated for what you are trying to work around. > > > > You are proposing a segmented layout of the virtual memory area so that > > an optional (userfaultfd) component can avoid a lock - which already has > > another optional (vma locking) workaround. > > > > I think we need to stand back and look at what we're doing here in > > regards to userfaultfd and how it interacts with everything. Things > > have gotten complex and we're going in the wrong direction. > > > > I suggest there is an easier way to avoid the contention, and maybe try > > to rectify some of the uffd code to fit better with the evolved use > > cases and vma locking. > > > > > > > > VMAs in the volatile region are assigned their own volatile_mmap_lock, > > > which is independent of the mmap_lock for the non-volatile region. > > > Additionally, we ensure that no single VMA spans the boundary between > > > the volatile and non-volatile regions. This separation prevents the > > > frequent modifications of a small number of volatile VMAs from blocking > > > other operations on a large number of non-volatile VMAs. > > > > > > The implementation itself wouldn’t be overly complex, but the design > > > might come across as somewhat hacky. > > I agree with others. Your proposal sounds too radical and doesn't seem > necessary to me. I'd like to see the traces and understand how > real/frequent the issue is. No worries, I figured the idea might not be well-received since it was more of a hack. Just try to explain that some VMAs might contribute more mmap_lock contention (volatile), while others might not. > > > > > > Lastly, I have two questions: > > > > > > 1. Have you observed similar issues where userfaultfd continues to > > > cause lock contention and priority inversion? > > We haven't seen any such cases so far. But due to some other reasons, > we are seriously considering temporarily increasing the GC-thread's > priority when it is running stop-the-world pause. > > > > > > 2. If so, do you have any ideas or suggestions on how to address this > > > problem? > > There are userspace solutions possible to reduce/eliminate the number > of times userfaultfd register/unregister are done during a GC. I > didn't do it due to added complexity it would introduce to the GC's > code. > > > > These are good questions. > > > > I have a few of my own about what you described: > > > > - What is causing your application to register/unregister so many uffds? > > In every GC invocation, we have two userfaultfd_register() + mremap() > in a stop-the-world pause, and then two userfaultfd_unregister() at > the end of GC. The problematic ones ought to be the one in the pause > as we want to keep it as short as possible. The reason we want to > register/unregister the heap during GC is so that the overhead of > userfaults can be avoided when GC is not active. > > > > > - Does the writes to the vmas overlap the register/unregsiter area > > today? That is, do you have writes besides register/unregister going > > into your proposed volatile area or uffd modifications happening in > > the 'default mmap' area you specify above? > > That shouldn't be the case. The access to uffd registered VMAs should > start *after* registration. That's the reason it is done in a pause. > AFAIK, the source of contention is if some native (non-java) thread, > which is not participating in the pause, does a mmap_lock write > operation (mmap/munmap/mprotect/mremap/mlock etc.) elsewhere in the > address space. The heap can't be involved. Exactly. Essentially, we observe that the GC holds the mmap_lock but gets preempted for an extended period, causing other tasks performing mmap-like operations to wait for the GC to release the lock. > > > > Barry, this is a good LSF topic - will you be there? I hope to attend. > > > > Something along the lines of "Userfualtfd contention, interactions, and > > mitigations". Thank you for your interest in this topic It's unlikely that a travel budget will be available, so I won’t be attending in person. I might apply for virtual attendance to participate in some discussions, but I don’t plan to run a session remotely—too many things can go wrong. > > > > Thanks, > > Liam > > Thanks Barry