On Mon, Feb 08, 2021 at 01:26:43PM +0000, Matthew Wilcox wrote: > Next problem: /proc/$pid/smaps calls walk_page_vma() which starts out by > saying: > mmap_assert_locked(walk.mm); > which made me realise that smaps is also going to walk the page tables. > So the page tables have to be pinned by the existence of the VMA. > Which means the page tables must be freed by the same RCU callback that > frees the VMA. But doing that means that a task which calls mmap(); > munmap(); mmap(); must avoid allocating the same address for the second > mmap (until the RCU grace period has elapsed), otherwise threads on > other CPUs may see the stale PTEs instead of the new ones. > > Solution 1: Move the page table freeing into the RCU callback, call > synchronize_rcu() in munmap(). > > Solution 2: Refcount the VMA and free the page tables on refcount > dropping to zero. This doesn't actually work because the stale PTE > problem still exists. > > Solution 3: When unmapping a VMA, instead of erasing the VMA from the > maple tree, put a "dead" entry in its place. Once the RCU freeing and the > TLB shootdown has happened, erase the entry and it can then be allocated. > If we do that MAP_FIXED will have to synchronize_rcu() if it overlaps > a dead entry. Solution 4: RCU free the page table pages and teach pagewalk.c to be RCU-safe. That means that it will have to use rcu_dereference() or READ_ONCE to dereference (eg) pmdp, but also allows GUP-fast to run under the rcu read lock instead of disabling interrupts.