Re: [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations

Barry Song <21cnbao@xxxxxxxxx> · Tue, 28 Jan 2025 06:08:31 +0800

On Fri, Jan 24, 2025 at 2:45 AM Lokesh Gidra <lokeshgidra@xxxxxxxxxx> wrote:
>
> On Thu, Jan 23, 2025 at 8:52 AM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
> >
> > * Barry Song <21cnbao@xxxxxxxxx> [250122 23:14]:
> > > > All userfaultfd operations, except write-protect, opportunistically use
> > > > per-vma locks to lock vmas. On failure, attempt again inside mmap_lock
> > > > critical section.
> > > >
> > > > Write-protect operation requires mmap_lock as it iterates over multiple
> > > > vmas.
> > > h
> > > Hi Lokesh,
> > >
> > > Apologies for reviving this old thread. We truly appreciate the excellent work
> > > you’ve done in transitioning many userfaultfd operations to per-VMA locks.
> > >
> > > However, we’ve noticed that userfaultfd still remains one of the largest users
> > > of mmap_lock for write operations, with the other—binder—having been recently
> > > addressed by Carlos Llamas's "binder: faster page installations" series:
> > >
> > > https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@xxxxxxxxxx/
> > >
> > > The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_register()
> > > and userfaultfd_unregister() operations, both of which require the mmap_lock
> > > in write mode to either split or merge VMAs. Since HeapTaskDaemon is a
> > > lower-priority background task, there are cases where, after acquiring the
> > > mmap_lock, it gets preempted by other tasks. As a result, even high-priority
> > > threads waiting for the mmap_lock — whether in writer or reader mode—can
> > > end up experiencing significant delays（The delay can reach several hundred
> > > milliseconds in the worst case.）
>
> Do you happen to have some trace that I can take a look at?

We observed a rough trace in Android Studio showing the HeapTaskDaemon
stuck in a runnable state after holding the mmap_lock for 1 second, while other
threads were waiting for the lock.

Our team will assist in collecting a detailed trace, but everyone is
currently on
an extended Chinese New Year holiday. Apologies, this may delay the process
until after February 8.

> >
> > This needs an RFC or proposal or a discussion - certainly not a reply to
> > an old v7 patch set.  I'd want neon lights and stuff directing people to
> > this topic.
> >
> > >
> > > We haven’t yet identified an ideal solution for this. However, the Java heap
> > > appears to behave like a "volatile" vma in its usage. A somewhat simplistic
> > > idea would be to designate a specific region of the user address space as
> > > "volatile" and restrict all "volatile" VMAs to this isolated region.
> >
> > I'm going to assume the uffd changes are in the volatile area?  But
> > really, maybe you mean the opposite..  I'll just assume I guessed
> > correct here.  Because, both sides of this are competing for the write
> > lock.
> >
> > >
> > > We may have a MAP_VOLATILE flag to mmap. VMA regions with this flag will be
> > > mapped to the volatile space, while those without it will be mapped to the
> > > non-volatile space.
> > >
> > >          ┌────────────┐TASK_SIZE
> > >          │            │
> > >          │            │
> > >          │            │mmap VOLATILE
> > >          ┼────────────┤
> > >          │            │
> > >          │            │
> > >          │            │
> > >          │            │
> > >          │            │default mmap
> > >          │            │
> > >          │            │
> > >          └────────────┘
> >
> > No, this is way too complicated for what you are trying to work around.
> >
> > You are proposing a segmented layout of the virtual memory area so that
> > an optional (userfaultfd) component can avoid a lock - which already has
> > another optional (vma locking) workaround.
> >
> > I think we need to stand back and look at what we're doing here in
> > regards to userfaultfd and how it interacts with everything.  Things
> > have gotten complex and we're going in the wrong direction.
> >
> > I suggest there is an easier way to avoid the contention, and maybe try
> > to rectify some of the uffd code to fit better with the evolved use
> > cases and vma locking.
> >
> > >
> > > VMAs in the volatile region are assigned their own volatile_mmap_lock,
> > > which is independent of the mmap_lock for the non-volatile region.
> > > Additionally, we ensure that no single VMA spans the boundary between
> > > the volatile and non-volatile regions. This separation prevents the
> > > frequent modifications of a small number of volatile VMAs from blocking
> > > other operations on a large number of non-volatile VMAs.
> > >
> > > The implementation itself wouldn’t be overly complex, but the design
> > > might come across as somewhat hacky.
>
> I agree with others. Your proposal sounds too radical and doesn't seem
> necessary to me. I'd like to see the traces and understand how
> real/frequent the issue is.

No worries, I figured the idea might not be well-received since it was more of
a hack. Just try to explain that some VMAs might contribute more mmap_lock
contention (volatile), while others might not.

> > >
> > > Lastly, I have two questions:
> > >
> > > 1. Have you observed similar issues where userfaultfd continues to
> > > cause lock contention and priority inversion?
>
> We haven't seen any such cases so far. But due to some other reasons,
> we are seriously considering temporarily increasing the GC-thread's
> priority when it is running stop-the-world pause.
> > >
> > > 2. If so, do you have any ideas or suggestions on how to address this
> > > problem?
>
> There are userspace solutions possible to reduce/eliminate the number
> of times userfaultfd register/unregister are done during a GC. I
> didn't do it due to added complexity it would introduce to the GC's
> code.
> >
> > These are good questions.
> >
> > I have a few of my own about what you described:
> >
> > - What is causing your application to register/unregister so many uffds?
>
> In every GC invocation, we have two userfaultfd_register() + mremap()
> in a stop-the-world pause, and then two userfaultfd_unregister() at
> the end of GC. The problematic ones ought to be the one in the pause
> as we want to keep it as short as possible. The reason we want to
> register/unregister the heap during GC is so that the overhead of
> userfaults can be avoided when GC is not active.
>
> >
> > - Does the writes to the vmas overlap the register/unregsiter area
> >   today?  That is, do you have writes besides register/unregister going
> >   into your proposed volatile area or uffd modifications happening in
> >   the 'default mmap' area you specify above?
>
> That shouldn't be the case. The access to uffd registered VMAs should
> start *after* registration. That's the reason it is done in a pause.
> AFAIK, the source of contention is if some native (non-java) thread,
> which is not participating in the pause, does a mmap_lock write
> operation (mmap/munmap/mprotect/mremap/mlock etc.) elsewhere in the
> address space. The heap can't be involved.

Exactly. Essentially, we observe that the GC holds the mmap_lock but
gets preempted for an extended period, causing other tasks performing
mmap-like operations to wait for the GC to release the lock.

> >
> > Barry, this is a good LSF topic - will you be there?  I hope to attend.
> >
> > Something along the lines of "Userfualtfd contention, interactions, and
> > mitigations".

Thank you for your interest in this topic

It's unlikely that a travel budget will be available, so I won’t be attending
in person. I might apply for virtual attendance to participate in some
discussions, but I don’t plan to run a session remotely—too many things
can go wrong.

> >
> > Thanks,
> > Liam
> >

Thanks
Barry