Re: [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations

Barry Song <21cnbao@xxxxxxxxx> · Thu, 23 Jan 2025 17:14:27 +1300

> All userfaultfd operations, except write-protect, opportunistically use
> per-vma locks to lock vmas. On failure, attempt again inside mmap_lock
> critical section.
> 
> Write-protect operation requires mmap_lock as it iterates over multiple
> vmas.

Hi Lokesh,

Apologies for reviving this old thread. We truly appreciate the excellent work
you’ve done in transitioning many userfaultfd operations to per-VMA locks.

However, we’ve noticed that userfaultfd still remains one of the largest users
of mmap_lock for write operations, with the other—binder—having been recently
addressed by Carlos Llamas's "binder: faster page installations" series:

https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@xxxxxxxxxx/

The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_register()
and userfaultfd_unregister() operations, both of which require the mmap_lock
in write mode to either split or merge VMAs. Since HeapTaskDaemon is a
lower-priority background task, there are cases where, after acquiring the
mmap_lock, it gets preempted by other tasks. As a result, even high-priority
threads waiting for the mmap_lock — whether in writer or reader mode—can
end up experiencing significant delays（The delay can reach several hundred
milliseconds in the worst case.）

We haven’t yet identified an ideal solution for this. However, the Java heap
appears to behave like a "volatile" vma in its usage. A somewhat simplistic
idea would be to designate a specific region of the user address space as
"volatile" and restrict all "volatile" VMAs to this isolated region.

We may have a MAP_VOLATILE flag to mmap. VMA regions with this flag will be
mapped to the volatile space, while those without it will be mapped to the
non-volatile space.

         ┌────────────┐TASK_SIZE             
         │            │                      
         │            │                      
         │            │mmap VOLATILE         
         ┼────────────┤                      
         │            │                      
         │            │                      
         │            │                      
         │            │                      
         │            │default mmap          
         │            │                      
         │            │                      
         └────────────┘   

VMAs in the volatile region are assigned their own volatile_mmap_lock,
which is independent of the mmap_lock for the non-volatile region.
Additionally, we ensure that no single VMA spans the boundary between
the volatile and non-volatile regions. This separation prevents the
frequent modifications of a small number of volatile VMAs from blocking
other operations on a large number of non-volatile VMAs.

The implementation itself wouldn’t be overly complex, but the design
might come across as somewhat hacky.

Lastly, I have two questions:

1. Have you observed similar issues where userfaultfd continues to
cause lock contention and priority inversion?

2. If so, do you have any ideas or suggestions on how to address this
problem?

> 
> Signed-off-by: Lokesh Gidra <lokeshgidra@xxxxxxxxxx>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx>
> ---
>  fs/userfaultfd.c              |  13 +-
>  include/linux/userfaultfd_k.h |   5 +-
>  mm/huge_memory.c              |   5 +-
>  mm/userfaultfd.c              | 380 ++++++++++++++++++++++++++--------
>  4 files changed, 299 insertions(+), 104 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index c00a021bcce4..60dcfafdc11a 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c

Thanks
Barry