On Fri, Apr 30, 2021 at 12:52:01PM -0700, Michel Lespinasse wrote: > This patchset is my take on speculative page faults (spf). > It builds on ideas that have been previously proposed by Laurent Dufour, > Peter Zijlstra and others before. While Laurent's previous proposal > was rejected around the time of LSF/MM 2019, I am hoping we can revisit > this now based on what I think is a simpler and more bisectable approach, > much improved scaling numbers in the anonymous vma case, and the Android > use case that has since emerged. I will expand on these points towards > the end of this message. I want to address a few questions that I think are likely to come up, about how this patchset relates to others currently being worked on, and about design points for this patchset (mainly about the per-mm sequence count). I- Maple tree I do not think there is any fundamental conflict between the maple tree patches currently being considered, and this patchset. I actually have a (very lightly tested) tree merging the two together, which was a fairly easy merge. For those interested, I made this available at my github, as the v5.12-maple-spf branch. At the same time, Matthew & Liam have made it known that they would like to build some lockless fault facilities on top of maple tree, and even though these ideas have not been implemented yet (AFAIK), my proposal probably falls short of what they have in mind. >From my point of view, I do not see that as a fundamental conflict either; my take is that I like to use a more incremental approach and that the speculative page fault ideas are worth exploring on their own; they could be further extended in the future with some of the additional ideas I have heard discussed in association to maple tree. I am aware of two main areas where my proposal is more limited than the plans I have heard from Matthew & Liam. Maybe there are more, and I hope they will correct me of that is the case. But, the ones I know about would be: 1- VMA lookups. This patchset has mmap writers update a sequence counter around updates; the speculative fault path uses that counter to detect concurrent updates when looking up and copying the VMA. This means lookups might fail if they overlap with a concurrent mmap writer; the alternative discussed by maple tree proponents would be to make VMAs immutable and have the writers actually make a new copy when they want to update. While this might impose some costs on the writers, it would benefit the fault path in two ways: first, lookups would always succeed, and second, the fault path wouldn't need to make a VMA copy. I think this is worth exploring, but can be done as a separate step. 2- Validation at the end of the page fault. After taking the page table lock but before inserting the new PTE, this patchset verifies the per-mm sequence counter to validate that no mmap writers ran concurrently with the fault. As people noted, this is quite restrictive; page faults may unnecessarily abort due to writers operating on a separate memory range. This topic is worthy discussion independently of the maple tree stuff, so I'll get back to it later down. Matthew & Liam, do you have other extensions in mind which I have not covered here ? II- Range locking Prior to this patchset I had been working on mmap range locking approaches, in order to allow non-overlapping memory operations to proceed concurrently. I think this is still an interesting idea, but the speculative page fault proposal is independent of it and is more mature so I think it should be submitted first. III- Thoughts about concurrency checks at the end of the page fault As noted, the check using the per-mm counter can lead to unnecessary speculative page fault aborts. Why do it that way then ? The first reason I want to give is practical. The types of faults this patchset implements speculatively tend to be fairly quick - in particular, no I/O is involved (for the swap case, we only implement the case of hitting into the swap cache). As a result, there is not very much time for concurrent mmap writers to interfere. I did try implementing a more precise check, but it did not significantly improve the success rate in workloads I looked at, so it seemed best to go with the simplest possible check first. But still, could we implement a precise check that never leads to unnecessary page fault aborts ? The simplest way to go about this would seem to be to look up the VMA again at the end of the page fault (after taking the page table lock but before inserting the new PTE into the page table). If the VMA attributes have not changed, we might be tempted to conclude it is safe to insert the new PTE and complete the page fault. However, I am not sure if that would always be correct. The case I am worried about is when breaking COW: - Page P is COW mapped into processes A and B - Thread A1 (within process A) takes a write fault on P - A1 allocates a new page P2 - A1 starts copying P into P2 - B unmaps P - Thread A2 (within process A) takes a write fault on P P now has only one mapping, so A2 just changes P to be writable A2's page fault completes - A2 writes into P - A2 calls mprotect() to make P's mapping readonly. P's PTE gets its W permission bit cleared. - A2 calls mprotect() to make P's mapping writable again. - A1 is done copying P into P2. A1 takes the page table lock A1 verifies that P's VMA has not changed - it's still a writable mapping A1 verifies that P's PTE has not changed - it still points to P with the W permission bit cleared. A1 updates the pte to point to the P2 page (with the W permission bit set) The above would be incorrect because A2's write into P may get lost. This seems like a convoluted scenario but I am not sure how to cleanly protect against it. Surely one could extend the validation mechanism (Laurent's proposal used per-VMA sequence counts), but there is still a possibility of unnecessary aborts there, so I don't think that is fully satisfactory. I think doing re-fetching the VMA at the end of the page fault would be safe in at least some of the cases though, most notably if the original PTE was pte_none. So maybe that would cover enough cases ? To sum it up, I agree that using the per-mm sequence count to validate page faults is imperfect, but I think it gives a decent first stab at the issue, and that further improvements are not trivial enough to design in a vacuum - they would be better handled by incrementally addressing problem workloads IMO. -- Michel "walken" Lespinasse