On Mon, Jan 31, 2022 at 05:14:34PM -0800, Andrew Morton wrote: > On Fri, 28 Jan 2022 05:09:31 -0800 Michel Lespinasse <michel@xxxxxxxxxxxxxx> wrote: > > The first step of a speculative page fault is to look up the vma and > > read its contents (currently by making a copy of the vma, though in > > principle it would be sufficient to only read the vma attributes that > > are used in page faults). The mmap sequence count is used to verify > > that there were no mmap writers concurrent to the lookup and copy steps. > > Note that walking rbtrees while there may potentially be concurrent > > writers is not an entirely new idea in linux, as latched rbtrees > > are already doing this. This is safe as long as the lookup is > > followed by a sequence check to verify that concurrency did not > > actually occur (and abort the speculative fault if it did). > > I'm surprised that descending the rbtree locklessly doesn't flat-out > oops the kernel. How are we assured that every pointer which is > encountered actually points at the right thing? Against things > which tear that tree down? It doesn't necessarily point at the _right_ thing. You may get entirely the wrong node in the tree if you race with a modification, but, as Michel says, you check the seqcount before you even look at the VMA (and if the seqcount indicates a modification, you throw away the result and fall back to the locked version). The rbtree always points to other rbtree nodes, so you aren't going to walk into some completely wrong data structure. > > The next step is to walk down the existing page table tree to find the > > current pte entry. This is done with interrupts disabled to avoid > > races with munmap(). > > Sebastian, could you please comment on this from the CONFIG_PREEMPT_RT > point of view? I am not a fan of this approach. For other reasons, I think we want to switch to RCU-freed page tables, and then we can walk the page tables with the RCU lock held. Some architectures already RCU-free the page tables, so I think it's just a matter of converting the rest.