Re: [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jan 23, 2025 at 8:52 AM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
>
> * Barry Song <21cnbao@xxxxxxxxx> [250122 23:14]:
> > > All userfaultfd operations, except write-protect, opportunistically use
> > > per-vma locks to lock vmas. On failure, attempt again inside mmap_lock
> > > critical section.
> > >
> > > Write-protect operation requires mmap_lock as it iterates over multiple
> > > vmas.
> > h
> > Hi Lokesh,
> >
> > Apologies for reviving this old thread. We truly appreciate the excellent work
> > you’ve done in transitioning many userfaultfd operations to per-VMA locks.
> >
> > However, we’ve noticed that userfaultfd still remains one of the largest users
> > of mmap_lock for write operations, with the other—binder—having been recently
> > addressed by Carlos Llamas's "binder: faster page installations" series:
> >
> > https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@xxxxxxxxxx/
> >
> > The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_register()
> > and userfaultfd_unregister() operations, both of which require the mmap_lock
> > in write mode to either split or merge VMAs. Since HeapTaskDaemon is a
> > lower-priority background task, there are cases where, after acquiring the
> > mmap_lock, it gets preempted by other tasks. As a result, even high-priority
> > threads waiting for the mmap_lock — whether in writer or reader mode—can
> > end up experiencing significant delays(The delay can reach several hundred
> > milliseconds in the worst case.)

Do you happen to have some trace that I can take a look at?
>
> This needs an RFC or proposal or a discussion - certainly not a reply to
> an old v7 patch set.  I'd want neon lights and stuff directing people to
> this topic.
>
> >
> > We haven’t yet identified an ideal solution for this. However, the Java heap
> > appears to behave like a "volatile" vma in its usage. A somewhat simplistic
> > idea would be to designate a specific region of the user address space as
> > "volatile" and restrict all "volatile" VMAs to this isolated region.
>
> I'm going to assume the uffd changes are in the volatile area?  But
> really, maybe you mean the opposite..  I'll just assume I guessed
> correct here.  Because, both sides of this are competing for the write
> lock.
>
> >
> > We may have a MAP_VOLATILE flag to mmap. VMA regions with this flag will be
> > mapped to the volatile space, while those without it will be mapped to the
> > non-volatile space.
> >
> >          ┌────────────┐TASK_SIZE
> >          │            │
> >          │            │
> >          │            │mmap VOLATILE
> >          ┼────────────┤
> >          │            │
> >          │            │
> >          │            │
> >          │            │
> >          │            │default mmap
> >          │            │
> >          │            │
> >          └────────────┘
>
> No, this is way too complicated for what you are trying to work around.
>
> You are proposing a segmented layout of the virtual memory area so that
> an optional (userfaultfd) component can avoid a lock - which already has
> another optional (vma locking) workaround.
>
> I think we need to stand back and look at what we're doing here in
> regards to userfaultfd and how it interacts with everything.  Things
> have gotten complex and we're going in the wrong direction.
>
> I suggest there is an easier way to avoid the contention, and maybe try
> to rectify some of the uffd code to fit better with the evolved use
> cases and vma locking.
>
> >
> > VMAs in the volatile region are assigned their own volatile_mmap_lock,
> > which is independent of the mmap_lock for the non-volatile region.
> > Additionally, we ensure that no single VMA spans the boundary between
> > the volatile and non-volatile regions. This separation prevents the
> > frequent modifications of a small number of volatile VMAs from blocking
> > other operations on a large number of non-volatile VMAs.
> >
> > The implementation itself wouldn’t be overly complex, but the design
> > might come across as somewhat hacky.

I agree with others. Your proposal sounds too radical and doesn't seem
necessary to me. I'd like to see the traces and understand how
real/frequent the issue is.
> >
> > Lastly, I have two questions:
> >
> > 1. Have you observed similar issues where userfaultfd continues to
> > cause lock contention and priority inversion?

We haven't seen any such cases so far. But due to some other reasons,
we are seriously considering temporarily increasing the GC-thread's
priority when it is running stop-the-world pause.
> >
> > 2. If so, do you have any ideas or suggestions on how to address this
> > problem?

There are userspace solutions possible to reduce/eliminate the number
of times userfaultfd register/unregister are done during a GC. I
didn't do it due to added complexity it would introduce to the GC's
code.
>
> These are good questions.
>
> I have a few of my own about what you described:
>
> - What is causing your application to register/unregister so many uffds?

In every GC invocation, we have two userfaultfd_register() + mremap()
in a stop-the-world pause, and then two userfaultfd_unregister() at
the end of GC. The problematic ones ought to be the one in the pause
as we want to keep it as short as possible. The reason we want to
register/unregister the heap during GC is so that the overhead of
userfaults can be avoided when GC is not active.

>
> - Does the writes to the vmas overlap the register/unregsiter area
>   today?  That is, do you have writes besides register/unregister going
>   into your proposed volatile area or uffd modifications happening in
>   the 'default mmap' area you specify above?

That shouldn't be the case. The access to uffd registered VMAs should
start *after* registration. That's the reason it is done in a pause.
AFAIK, the source of contention is if some native (non-java) thread,
which is not participating in the pause, does a mmap_lock write
operation (mmap/munmap/mprotect/mremap/mlock etc.) elsewhere in the
address space. The heap can't be involved.
>
> Barry, this is a good LSF topic - will you be there?  I hope to attend.
>
> Something along the lines of "Userfualtfd contention, interactions, and
> mitigations".
>
> Thanks,
> Liam
>





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux