On Tue, Feb 09, 2021 at 01:38:22PM -0400, Jason Gunthorpe wrote: > On Tue, Feb 09, 2021 at 06:19:35PM +0100, Laurent Dufour wrote: > > Le 09/02/2021 à 15:29, Matthew Wilcox a écrit : > > > On Mon, Feb 08, 2021 at 01:26:43PM +0000, Matthew Wilcox wrote: > > > > Next problem: /proc/$pid/smaps calls walk_page_vma() which starts out by > > > > saying: > > > > mmap_assert_locked(walk.mm); > > > > which made me realise that smaps is also going to walk the page tables. > > > > So the page tables have to be pinned by the existence of the VMA. > > > > Which means the page tables must be freed by the same RCU callback that > > > > frees the VMA. But doing that means that a task which calls mmap(); > > > > munmap(); mmap(); must avoid allocating the same address for the second > > > > mmap (until the RCU grace period has elapsed), otherwise threads on > > > > other CPUs may see the stale PTEs instead of the new ones. > > > > > > > > Solution 1: Move the page table freeing into the RCU callback, call > > > > synchronize_rcu() in munmap(). > > > > > > > > Solution 2: Refcount the VMA and free the page tables on refcount > > > > dropping to zero. This doesn't actually work because the stale PTE > > > > problem still exists. > > > > > > > > Solution 3: When unmapping a VMA, instead of erasing the VMA from the > > > > maple tree, put a "dead" entry in its place. Once the RCU freeing and the > > > > TLB shootdown has happened, erase the entry and it can then be allocated. > > > > If we do that MAP_FIXED will have to synchronize_rcu() if it overlaps > > > > a dead entry. > > > > > > Solution 4: RCU free the page table pages and teach pagewalk.c to > > > be RCU-safe. That means that it will have to use rcu_dereference() > > > or READ_ONCE to dereference (eg) pmdp, but also allows GUP-fast to run > > > under the rcu read lock instead of disabling interrupts. > > > > I might be wrong but my understanding is that the RCU window could not be > > closed on a CPU where IRQs are disabled. So in a first step GUP-fast might > > continue to disable interrupts to get safe walking the page directories. > > Yes, this is right. PPC already uses RCU for the TLB flush and the > GUP-fast trick is safe against that. > > The comments for PPC say the downside of RCU is having to do an > allocation in paths that really don't want to fail on memory > exhaustion > > The pagewalk.c needs to call its ops in a sleepable context, otherwise > it could just use the normal page table locks.. Not sure RCU could be > fit into here? Depends on the caller of walk_page_*() whether the ops need to sleep or not. The specific problem we're trying to solve here is avoiding taking the mmap_sem in /proc/$pid/smaps. Now, we could just disable interrupts instead of taking the mmap_sem, but I was hoping to do better. So let's call that Solution 5: - smaps disables interrupts while calling pagewalk. - pagewalk accepts that it can be called locklessly (uses ptep_get_lockless() and so on) - smaps figures out how to handle races with khugepaged