> On Jan 4, 2021, at 1:01 PM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: > > On Mon, Jan 04, 2021 at 08:39:37PM +0000, Nadav Amit wrote: >>> On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: >>> >>> On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote: >>>>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: >>>>> >>>>> Hello, >>>>> >>>>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: >>>>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >>>>>> >>>>>>> The scenario that happens in selftests/vm/userfaultfd is as follows: >>>>>>> >>>>>>> cpu0 cpu1 cpu2 >>>>>>> ---- ---- ---- >>>>>>> [ Writable PTE >>>>>>> cached in TLB ] >>>>>>> userfaultfd_writeprotect() >>>>>>> [ write-*unprotect* ] >>>>>>> mwriteprotect_range() >>>>>>> mmap_read_lock() >>>>>>> change_protection() >>>>>>> >>>>>>> change_protection_range() >>>>>>> ... >>>>>>> change_pte_range() >>>>>>> [ *clear* “write”-bit ] >>>>>>> [ defer TLB flushes ] >>>>>>> [ page-fault ] >>>>>>> ... >>>>>>> wp_page_copy() >>>>>>> cow_user_page() >>>>>>> [ copy page ] >>>>>>> [ write to old >>>>>>> page ] >>>>>>> ... >>>>>>> set_pte_at_notify() >>>>>> >>>>>> Yuck! >>>>> >>>>> Note, the above was posted before we figured out the details so it >>>>> wasn't showing the real deferred tlb flush that caused problems (the >>>>> one showed on the left causes zero issues). >>>> >>>> Actually it was posted after (note that this is v2). The aforementioned >>>> scenario that Peter regards to is the one that I actually encountered (not >>>> the second scenario that is “theoretical”). This scenario that Peter regards >>>> is indeed more “stupid” in the sense that we should just not write-protect >>>> the PTE on userfaultfd write-unprotect. >>>> >>>> Let me know if I made any mistake in the description. >>> >>> I didn't say there is a mistake. I said it is not showing the real >>> deferred tlb flush that cause problems. >>> >>> The issue here is that we have a "defer tlb flush" that runs after >>> "write to old page". >>> >>> If you look at the above, you're induced to think the "defer tlb >>> flush" that causes issues is the one in cpu0. It's not. That is >>> totally harmless. >> >> I do not understand what you say. The deferred TLB flush on cpu0 *is* the >> the one that causes the problem. The PTE is write-protected (although it is >> a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle >> the page-fault (and copy), while cpu2 keeps writing to the source page. If >> cpu0 did not defer the TLB flush, this problem would not happen. > > Your argument "If cpu0 did not defer the TLB flush, this problem would > not happen" is identical to "if the cpu0 has a small TLB size and the > tlb entry is recycled, the problem would not happen". > > There are a multitude of factors that are unrelated to the real > problematic deferred tlb flush that may happen and still won't cause > the issue, including a parallel IPI. > > The point is that we don't need to worry about the "defer TLB flushes" > of the un-wrprotect, because you said earlier that deferring tlb > flushes when you're doing "permission promotions" does not cause > problems. > > The only "deferred tlb flush" we need to worry about, not in the > picture, is the one following the actual permission removal (the > wrprotection). I think you are missing the point of this scenario, which is different than the second scenario. In this scenario, change_pte_range(), when called to do userfaultfd’s *unprotect* operation, did not preserve the write-bit if it was already set. Instead change_pte_range() *cleared* the write-bit. So upon a logical permission promotion operation - userfaultfd *unprotect* - you got a physical permission demotion, turning RW PTEs into RO. This problem is fully resolved by this part of the patch: --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, oldpte = *pte; if (pte_present(oldpte)) { pte_t ptent; - bool preserve_write = prot_numa && pte_write(oldpte); + bool preserve_write = (prot_numa || uffd_wp_resolve) && + pte_write(oldpte); You can argue that this not directly related to the deferred TLB flush, as once this chunk is added, a TLB flush would not be needed at all for userfaultfd-unprotect. But I consider it a part of the problem, especially since this is what triggered the userfaultfd self-tests to fail. >> it shows the write that triggers the corruption instead of discussing >> “windows”, which might be less clear. Running copy_user_page() with stale > > I think showing exactly where the race window opens is key to > understand the issue, but then that's the way I work and feel free to > think it in any other way that may sound simpler. > > I just worried people thinks the deferred tlb flush in your v2 trace > is the one that causes problem when obviously it's not since it > follows a permission promotion. Once that is clear, feel free to > reject my trace. > > All I care about is that performance don't regress from CPU-speed to > disk I/O spindle speed, for soft dirty and uffd-wp. I would feel more comfortable if you provide patches for uffd-wp. If you want, I will do it, but I restate that I do not feel comfortable with this solution (worried as it seems a bit ad-hoc and might leave out a scenario we all missed or cause a TLB shootdown storm). As for soft-dirty, I thought that you said that you do not see a better (backportable) solution for soft-dirty. Correct me if I am wrong. Anyhow, I will add your comments regarding the stale TLB window to make the description clearer.