On Thu, Jan 09, 2020 at 06:32:06PM +0100, SeongJae Park wrote: > On Thu, 9 Jan 2020 18:07:15 +0100 Michal Hocko <mhocko@xxxxxxxx> wrote: > > > On Thu 09-01-20 18:03:25, Michal Hocko wrote: > > > I might misremember but RCU based VMA handling has > > > been considered in the past. I do not remember details but there were > > > some problems and page tables allocation is not the biggest one. > > > > I have found https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf in my > > notes. I managed to forget everything but maybe it will be useful for a > > reference. > > The subsequent work from the authors > (https://people.csail.mit.edu/nickolai/papers/clements-radixvm-2014-08-05.pdf) > might be also useful for the understanding of the limitations found from the > work. Thanks for both those references. > I also forgot many details but as far as I remember, the biggest problem with > the rcuvm was the update side scalability limitation that results from the > single updater lock and the TLB invalidations. I has also internally > implemented another RCU based vm that utilizing fine-grained update side > synchronization. The write side performance of my version was therefore much > improved, but it also dropped the performance at the end with heavily > write-intensive workloads due to the TLB flush overhead. > > Page table allocations weren't bothered me at that time. As far as I can tell, both these implementations work by using RCU to look up a VMA, taking a reference count on the VMA and dropping the RCU read lock before walking the page tables. Sleeping to allocate page tables will be fine as the reference count prevents the VMA from going away. One of the use cases that we're concerned about involves a high percentage of page faults on a single large (terabytes) VMA (and a highly multithreaded process). Moving the contention from a rwsem in the mm_struct to a refcount in the VMA will not help performance substantially for this user. The proposal consists of three phases. In phase 1, we convert the rbtree to the maple tree, and leave the locking alone. In phase 2, we change the locking to a per-VMA refcount, looked up under RCU. This problem arises during phase 3 where we attempt to handle page faults entirely under the RCU read lock. If we encounter problems, we can fall back to acquiring the VMA refcount, but we need the page allocation to fail rather than sleep (or magically drop the RCU lock and return an indication that it has done so, but that doesn't seem to be an approach that would find any favour).