Hi, I have two MM topics to propose for LSF/MM/BPF 2021, both in the area of mmap lock performance: I - Speculative page faults The idea there is to avoid taking the mmap lock during page faults, at least for the easier cases. This requiers the fault handler to be a careful to avoid races with mmap writers (and most particularly munmap), and when the new page is ready to be inserted into the user process, to verify, at the last moment (after taking the page table lock), that there has been no race between the fault handler and any mmap writers. Such checks can be implemented locally, without hitting any global locks, which results in very nice scalability improvements when processing concurrent faults. I think the idea is ready for prime time, and a patchset has been proposed, but it is not getting much traction yet. I suspect we will need to discuss the idea in person to figure out the next steps. II - Fine grained MM locking A major limitation of the current mmap lock design is that it covers a process's entire address space. In threaded applications, it is common for threads to issue concurrent requests for non-overlapping parts of the process address space - for example, one thread might be mmaping new memory while another releases a different range, and a third might fault within his own address range too. The current mmap lock design does not take the non-overlapping ranges into consideration, and consequently serialises the 3 above requests rather than letting them proceed in parallel. There has been a lot of work spent mitigating the problem by reducing the mmap lock hold times (for example, dropping the mmap lock during page faults that hit disk, or lowering to a read lock during longer mmap/munmap/populate operations). But this approach is hitting its limits, and I think it would be better to fix the core of the problem by making the mmap lock capable of allowing concurrent non-overlapping operations. I would like to propose an approach that: - separates the mmap lock into two separate locks, one that is only held for short periods of time to protect mm-wide data structures (including the vma tree), and another that functions as a range lock and can be held for longer periods of time; - allows for incremental conversion from the current code to being aware about locking ranges; I have been maintaining a prototype for this, which has been shared with a small set of people. The main holdup is with page fault performance; in order to allow non-overlapping writers to proceed while some page faults are in progress, the prototype needs to maintain a shared structure holding addresses for each pending page fault. Updating this shared structure gets very expenside in high concurrency page fault benchmarks, though it seems quite unnoticeable in macro benchmarks I hae looked at. Sorry for the lenghty proposal - I swear I've tried to keep it short :) Thanks, -- Michel "walken" Lespinasse