> On Jun 28, 2022, at 12:15 PM, Peter Xu <peterx@xxxxxxxxxx> wrote: > > ⚠ External Email > > On Mon, Jun 27, 2022 at 11:37:32PM +0000, Nadav Amit wrote: >> >> Besides the benefit of avoiding a TLB flush, there is also the benefit >> of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be >> preceded by memory write to the shared memory, but that does not have to >> be the case. Similarly, if in the future userfaultfd would also support >> memory-backed private mappings, that does not have to be the case either. > > Could I ask what's the not supported private mapping you're talking about > (and I think you meant "file-backed")? I thought all kinds of private > mappings are supported on all modes already? Yes, I meant file-backed. See vma_can_userfault() for the supported types of memory: static inline bool vma_can_userfault(struct vm_area_struct *vma, unsigned long vm_flags) { if (vm_flags & VM_UFFD_MINOR) return is_vm_hugetlb_page(vma) || vma_is_shmem(vma); #ifndef CONFIG_PTE_MARKER_UFFD_WP /* * If user requested uffd-wp but not enabled pte markers for * uffd-wp, then shmem & hugetlbfs are not supported but only * anonymous. */ if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma)) return false; #endif return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || vma_is_shmem(vma); } > > Thanks for explaining, so yes either false positive or false negative on > pte dirty bit can bring a side effect on things, and the user knows the > best. Makes sense. I think what I overlooked is the downside of having > both write==dirty==1 when it doesn't need to. > >> >> Putting all of the above aside, there is a bug in my code, but this >> bug also points why dirty should not be set unconditionally. If someone >> uses SOFT_DIRTY with userfaultfd, then marking the PTE as dirty (and >> soft-dirty) might be misleading, causing unnecessary userspace writeback >> of memory. > > Well I never thought anyone would be using soft-dirty with uffd because > logically uffd-wp was somehow trying to replace it with a better interface, > hopefully. I have heard about some who does (not me). So I do not make it up, unfortunately. > > If to talk about it, IMHO the most important thing is really not the dirty > bit but the write bit: even if you leave both the dirty bits cleared (so > the user specified !WRITE_HINT then we grant write bit only), it means when > the write happens for sure dirty bit will be set on demand but not > soft-dirty since we won't generate a page fault at all. AFAICT it'll be a > false negative. > > The old code is safe in that respect on that we always set dirty so we can > only have false positive (which is acceptable in this case) but not false > negative (which is not). I agree that the PTE should be left RO in this case. I was just pointing that the user might not be aware of soft-dirty being used, and then it makes sense to treat the (lack of) write-hint appropriately. In the current code, when PTEs are unconditionally marked as dirty, even when they are write-protected, soft-dirty would produce false-positives. Nothing would break, but it is not good for performance either. > >> >> So I do need to fix my code so it would not write-unprotect memory if >> soft-dirty is enabled and UFFD_FLAGS_WRITE_LIKELY is not provided. But >> I think it emphasizes the benefit of having UFFD_FLAGS_WRITE_LIKELY. > > Sorry to say that, but to me it seems a proof that dirty bit is even more > tricky than access bit so we need to have better thoughts, even if the > WRITE_LIKELY hint sounds straightforward. I now can buy in all the tlb > flush points you explained, but still IMHO that's too fine granule (some > random ptes in a batched tlb range flush may not even matter!) and I'm just > naively wondering whether we need more thoughts for an uABI to be proposed > like that. I won't ask for some use cases and especially numbers to prove > assuming I'm asking too much, but really that'll be very persuasive if we > can get. I'd not worry for ACCESS_LIKELY on it because that's much more > straightforward to me. Let me get back to you with numbers. > >> >>> >>> Maybe with the to be proposed RFC patch for tlb flush we can know whether >>> that should be something we can rely on. It'll add more dependency on this >>> work which I'm sorry to say. It's just that IMHO we should think carefully >>> for the write-hint because this is a solid new uABI we're talking about. >>> >>> The other option is we can introduce the access hint first and think more >>> on the dirty one (we can always add it when proper). What do you think? >>> Also, David please chim in anytime if I missed the whole point when you >>> proposed the idea. >>> >>>> >>>> But this discussion made me think that there are two somewhat related >>>> matters that we may want to address: >>>> >>>> 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp >>>> to proactively make entries writable and save . >>> >>> I'm not sure I'm right here, but I think David's patch should have covered >>> that case? The new helper only checks pte_uffd_wp() based on my memory, >>> and when resolving page faults uffd-wp bit should have been gone, so it >>> should be treated the same as normal ptes. >> >> Let’s see we get to the same page: >> >> mwriteprotect_range() does: >> >> change_protection(&tlb, dst_vma, start, start + len, newprot, >> enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE) >> >> As you see no use of MM_CP_TRY_CHANGE_WRITABLE. >> >> And then change_pte_range() does: >> >> if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && >> !pte_write(ptent) && >> can_change_pte_writable(vma, addr, ptent)) >> ptent = pte_mkwrite(ptent); >> >> If we do not provide MM_CP_TRY_CHANGE_WRITABLE, the PTE would not be >> writable. >> >> Now for the record, we may want an additional optimization, so if a >> fault handler is woken because a PTE become writable, we do not retry >> the page fault (since the PTE is already writable). It is a small change, >> but let’s see first we agree on the patches so far… > > Ah I see what I missed, thanks. > > Another option is we call can_change_pte_writable() somehow in the > unprotect branch. I don’t think that I got this one. tl;dr: I’ll run some numbers (access when PTE-dirty/clear), address the aforementioned issues and send v2. Based on the results we can decide whether it makes sense.