Re: [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations

"Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx> · Thu, 23 Jan 2025 11:52:13 -0500

* Barry Song <21cnbao@xxxxxxxxx> [250122 23:14]:
> > All userfaultfd operations, except write-protect, opportunistically use
> > per-vma locks to lock vmas. On failure, attempt again inside mmap_lock
> > critical section.
> > 
> > Write-protect operation requires mmap_lock as it iterates over multiple
> > vmas.
> h
> Hi Lokesh,
> 
> Apologies for reviving this old thread. We truly appreciate the excellent work
> you’ve done in transitioning many userfaultfd operations to per-VMA locks.
> 
> However, we’ve noticed that userfaultfd still remains one of the largest users
> of mmap_lock for write operations, with the other—binder—having been recently
> addressed by Carlos Llamas's "binder: faster page installations" series:
> 
> https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@xxxxxxxxxx/
> 
> The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_register()
> and userfaultfd_unregister() operations, both of which require the mmap_lock
> in write mode to either split or merge VMAs. Since HeapTaskDaemon is a
> lower-priority background task, there are cases where, after acquiring the
> mmap_lock, it gets preempted by other tasks. As a result, even high-priority
> threads waiting for the mmap_lock — whether in writer or reader mode—can
> end up experiencing significant delays（The delay can reach several hundred
> milliseconds in the worst case.）

This needs an RFC or proposal or a discussion - certainly not a reply to
an old v7 patch set.  I'd want neon lights and stuff directing people to
this topic.

> 
> We haven’t yet identified an ideal solution for this. However, the Java heap
> appears to behave like a "volatile" vma in its usage. A somewhat simplistic
> idea would be to designate a specific region of the user address space as
> "volatile" and restrict all "volatile" VMAs to this isolated region.

I'm going to assume the uffd changes are in the volatile area?  But
really, maybe you mean the opposite..  I'll just assume I guessed
correct here.  Because, both sides of this are competing for the write
lock.

> 
> We may have a MAP_VOLATILE flag to mmap. VMA regions with this flag will be
> mapped to the volatile space, while those without it will be mapped to the
> non-volatile space.
> 
>          ┌────────────┐TASK_SIZE             
>          │            │                      
>          │            │                      
>          │            │mmap VOLATILE         
>          ┼────────────┤                      
>          │            │                      
>          │            │                      
>          │            │                      
>          │            │                      
>          │            │default mmap          
>          │            │                      
>          │            │                      
>          └────────────┘   

No, this is way too complicated for what you are trying to work around.

You are proposing a segmented layout of the virtual memory area so that
an optional (userfaultfd) component can avoid a lock - which already has
another optional (vma locking) workaround.

I think we need to stand back and look at what we're doing here in
regards to userfaultfd and how it interacts with everything.  Things
have gotten complex and we're going in the wrong direction.

I suggest there is an easier way to avoid the contention, and maybe try
to rectify some of the uffd code to fit better with the evolved use
cases and vma locking.

> 
> VMAs in the volatile region are assigned their own volatile_mmap_lock,
> which is independent of the mmap_lock for the non-volatile region.
> Additionally, we ensure that no single VMA spans the boundary between
> the volatile and non-volatile regions. This separation prevents the
> frequent modifications of a small number of volatile VMAs from blocking
> other operations on a large number of non-volatile VMAs.
> 
> The implementation itself wouldn’t be overly complex, but the design
> might come across as somewhat hacky.
> 
> Lastly, I have two questions:
> 
> 1. Have you observed similar issues where userfaultfd continues to
> cause lock contention and priority inversion?
> 
> 2. If so, do you have any ideas or suggestions on how to address this
> problem?

These are good questions.

I have a few of my own about what you described:

- What is causing your application to register/unregister so many uffds?

- Does the writes to the vmas overlap the register/unregsiter area
  today?  That is, do you have writes besides register/unregister going
  into your proposed volatile area or uffd modifications happening in
  the 'default mmap' area you specify above?

Barry, this is a good LSF topic - will you be there?  I hope to attend.

Something along the lines of "Userfualtfd contention, interactions, and
mitigations".

Thanks,
Liam