[LSF/MM TOPIC] mmap locking topics

Michel Lespinasse <michel@xxxxxxxxxxxxxx> · Mon, 31 May 2021 21:48:45 -0700

Hi,

I have two MM topics to propose for LSF/MM/BPF 2021,
both in the area of mmap lock performance:

I - Speculative page faults

The idea there is to avoid taking the mmap lock during page faults,
at least for the easier cases. This requiers the fault handler to be
a careful to avoid races with mmap writers (and most particularly
munmap), and when the new page is ready to be inserted into the user
process, to verify, at the last moment (after taking the page table
lock), that there has been no race between the fault handler and any
mmap writers.  Such checks can be implemented locally, without hitting
any global locks, which results in very nice scalability improvements
when processing concurrent faults.

I think the idea is ready for prime time, and a patchset has been proposed,
but it is not getting much traction yet. I suspect we will need to discuss
the idea in person to figure out the next steps.

II - Fine grained MM locking

A major limitation of the current mmap lock design is that it covers a
process's entire address space. In threaded applications, it is common
for threads to issue concurrent requests for non-overlapping parts of
the process address space - for example, one thread might be mmaping
new memory while another releases a different range, and a third might
fault within his own address range too. The current mmap lock design
does not take the non-overlapping ranges into consideration, and
consequently serialises the 3 above requests rather than letting them
proceed in parallel.

There has been a lot of work spent mitigating the problem by reducing
the mmap lock hold times (for example, dropping the mmap lock during
page faults that hit disk, or lowering to a read lock during longer
mmap/munmap/populate operations). But this approach is hitting its
limits, and I think it would be better to fix the core of the problem
by making the mmap lock capable of allowing concurrent non-overlapping
operations.

I would like to propose an approach that:
- separates the mmap lock into two separate locks, one that is only
  held for short periods of time to protect mm-wide data structures
  (including the vma tree), and another that functions as a range lock
  and can be held for longer periods of time;
- allows for incremental conversion from the current code to being
  aware about locking ranges;

I have been maintaining a prototype for this, which has been shared
with a small set of people. The main holdup is with page fault
performance; in order to allow non-overlapping writers to proceed
while some page faults are in progress, the prototype needs to
maintain a shared structure holding addresses for each pending page
fault. Updating this shared structure gets very expenside in high
concurrency page fault benchmarks, though it seems quite unnoticeable
in macro benchmarks I hae looked at.

Sorry for the lenghty proposal - I swear I've tried to keep it short :)

Thanks,

--
Michel "walken" Lespinasse