> On Jan 5, 2021, at 12:58 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote: >> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: >>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >>> >>>> The scenario that happens in selftests/vm/userfaultfd is as follows: >>>> >>>> cpu0 cpu1 cpu2 >>>> ---- ---- ---- >>>> [ Writable PTE >>>> cached in TLB ] >>>> userfaultfd_writeprotect() >>>> [ write-*unprotect* ] >>>> mwriteprotect_range() >>>> mmap_read_lock() >>>> change_protection() >>>> >>>> change_protection_range() >>>> ... >>>> change_pte_range() >>>> [ *clear* “write”-bit ] >>>> [ defer TLB flushes ] >>>> [ page-fault ] >>>> ... >>>> wp_page_copy() >>>> cow_user_page() >>>> [ copy page ] >>>> [ write to old >>>> page ] >>>> ... >>>> set_pte_at_notify() >>> >>> Yuck! >> >> Note, the above was posted before we figured out the details so it >> wasn't showing the real deferred tlb flush that caused problems (the >> one showed on the left causes zero issues). >> >> The problematic one not pictured is the one of the wrprotect that has >> to be running in another CPU which is also isn't picture above. More >> accurate traces are posted later in the thread. > > Lets assume CPU0 does a read-lock, W -> RO with deferred flush. This is the second scenario that is mentioned in the patch. (The first one is relatively easy to address by not clearing the write-bit). >>> Isn't this all rather similar to the problem that resulted in the >>> tlb_flush_pending mess? >>> >>> I still think that's all fundamentally buggered, the much saner solution >>> (IMO) would've been to make things wait for the pending flush, instead >> >> How do intend you wait in PT lock while the writer also has to take PT >> lock repeatedly before it can do wake_up_var? >> >> If you release the PT lock before calling wait_tlb_flush_pending it >> all falls apart again. > > I suppose you can check for pending, if found, release lock, wait for 0, > and re-take the fault? My personal take on this issue (which for full disclosure I think Andrea disagrees with) is that it the most important enhancement is to reduce the number of cases which we mistakenly think that we must wait for pending TLB flush. It will not be free though. As to the enhancement that you propose: although it seems as a valid enhancement to me, I think that it is more robust to make forward progress when possible (as done today). This is especially important if the proposed enhancement cannot be checked by lockdep.