On Fri, 28 Jan 2022 05:09:31 -0800 Michel Lespinasse <michel@xxxxxxxxxxxxxx> wrote: > Patchset summary: > > Classical page fault processing takes the mmap read lock in order to > prevent races with mmap writers. In contrast, speculative fault > processing does not take the mmap read lock, and instead verifies, > when the results of the page fault are about to get committed and > become visible to other threads, that no mmap writers have been > running concurrently with the page fault. If the check fails, > speculative updates do not get committed and the fault is retried > in the usual, non-speculative way (with the mmap read lock held). > > The concurrency check is implemented using a per-mm mmap sequence count. > The counter is incremented at the beginning and end of each mmap write > operation. If the counter is initially observed to have an even value, > and has the same value later on, the observer can deduce that no mmap > writers have been running concurrently with it between those two times. > This is similar to a seqlock, except that readers never spin on the > counter value (they would instead revert to taking the mmap read lock), > and writers are allowed to sleep. One benefit of this approach is that > it requires no writer side changes, just some hooks in the mmap write > lock APIs that writers already use. > > The first step of a speculative page fault is to look up the vma and > read its contents (currently by making a copy of the vma, though in > principle it would be sufficient to only read the vma attributes that > are used in page faults). The mmap sequence count is used to verify > that there were no mmap writers concurrent to the lookup and copy steps. > Note that walking rbtrees while there may potentially be concurrent > writers is not an entirely new idea in linux, as latched rbtrees > are already doing this. This is safe as long as the lookup is > followed by a sequence check to verify that concurrency did not > actually occur (and abort the speculative fault if it did). I'm surprised that descending the rbtree locklessly doesn't flat-out oops the kernel. How are we assured that every pointer which is encountered actually points at the right thing? Against things which tear that tree down? > The next step is to walk down the existing page table tree to find the > current pte entry. This is done with interrupts disabled to avoid > races with munmap(). Sebastian, could you please comment on this from the CONFIG_PREEMPT_RT point of view? > Again, not an entirely new idea, as this repeats > a pattern already present in fast GUP. Similar precautions are also > taken when taking the page table lock. > > Breaking COW on an existing mapping may require firing MMU notifiers. > Some care is required to avoid racing with registering new notifiers. > This patchset adds a new per-cpu rwsem to handle this situation.