> On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: > > On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote: >>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: >>> >>> Hello, >>> >>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: >>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >>>> >>>>> The scenario that happens in selftests/vm/userfaultfd is as follows: >>>>> >>>>> cpu0 cpu1 cpu2 >>>>> ---- ---- ---- >>>>> [ Writable PTE >>>>> cached in TLB ] >>>>> userfaultfd_writeprotect() >>>>> [ write-*unprotect* ] >>>>> mwriteprotect_range() >>>>> mmap_read_lock() >>>>> change_protection() >>>>> >>>>> change_protection_range() >>>>> ... >>>>> change_pte_range() >>>>> [ *clear* “write”-bit ] >>>>> [ defer TLB flushes ] >>>>> [ page-fault ] >>>>> ... >>>>> wp_page_copy() >>>>> cow_user_page() >>>>> [ copy page ] >>>>> [ write to old >>>>> page ] >>>>> ... >>>>> set_pte_at_notify() >>>> >>>> Yuck! >>> >>> Note, the above was posted before we figured out the details so it >>> wasn't showing the real deferred tlb flush that caused problems (the >>> one showed on the left causes zero issues). >> >> Actually it was posted after (note that this is v2). The aforementioned >> scenario that Peter regards to is the one that I actually encountered (not >> the second scenario that is “theoretical”). This scenario that Peter regards >> is indeed more “stupid” in the sense that we should just not write-protect >> the PTE on userfaultfd write-unprotect. >> >> Let me know if I made any mistake in the description. > > I didn't say there is a mistake. I said it is not showing the real > deferred tlb flush that cause problems. > > The issue here is that we have a "defer tlb flush" that runs after > "write to old page". > > If you look at the above, you're induced to think the "defer tlb > flush" that causes issues is the one in cpu0. It's not. That is > totally harmless. I do not understand what you say. The deferred TLB flush on cpu0 *is* the the one that causes the problem. The PTE is write-protected (although it is a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle the page-fault (and copy), while cpu2 keeps writing to the source page. If cpu0 did not defer the TLB flush, this problem would not happen. >>> The problematic one not pictured is the one of the wrprotect that has >>> to be running in another CPU which is also isn't picture above. More >>> accurate traces are posted later in the thread. >> >> I think I included this scenario as well in the commit log (of v2). Let me >> know if I screwed up and the description is not clear. > > Instead of not showing the real "defer tlb flush" in the trace and > then fixing it up in the comment, why don't you take the trace showing > the real problematic "defer tlb flush"? No need to reinvent it. The scenario you mention is indeed identical to the second scenario I mention in the commit log. I think the version I included is cleared since it shows the write that triggers the corruption instead of discussing “windows”, which might be less clear. Running copy_user_page() with stale TLB is by itself not a bug if you detect it later (e.g., using pte_same()). Note that my second scenario is also consistent in style with the first scenario. I am not married to my description and if you (and others) insist I would copy-paste your version. > This show the real deferred tlb flush, your v2 does not include it > instead. Are you talking about the first scenario (write-unprotect), the second one (write-protect followed by write-unprotect), both? It seems to me that all the deferred TLB flushes are mentioned at the point they are deferred. I can add the point in which they are performed.