On Jun 21, 2022, at 9:38 AM, Peter Xu <peterx@xxxxxxxxxx> wrote: > Hi, Nadav, > > On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote: >> From: Nadav Amit <namit@xxxxxxxxxx> >> >> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in >> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates. >> However, setting read-only PTEs as dirty can have several undesired >> implications. >> >> First, setting read-only PTEs as dirty, can cause these PTEs to become >> writable during mprotect() syscall. See in change_pte_range(): >> >> /* Avoid taking write faults for known dirty pages */ >> if (dirty_accountable && pte_dirty(ptent) && >> (pte_soft_dirty(ptent) || >> !(vma->vm_flags & VM_SOFTDIRTY))) { >> ptent = pte_mkwrite(ptent); >> } > > IMHO this is not really the direct reason to add the feature to "allow user > to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because > IIUC what's missing is the pte_uffd_wp() check that I should have added in > the shmem+hugetlb uffd-wp series in change_pte_range() but I missed.. > > But since this is fixed by David's patch to optimize mprotect() altogether > which checks pte_uffd_wp() (and afaict that's only needed after the > shmem+hugetlb patchset, not before), so I think we're safe now with all the > branches. > > So IMHO we don't need to mention this as it's kind of misleading. It'll be > welcomed if you want to recover the pte_dirty behavior in > mfill_atomic_install_pte() but probably this is not the right patch for it? Sorry, I will remove it from the commit log. I am not sure whether there should be an additional patch to revert (or partially revert) 9ae0f87d009ca and to be backported. What do you say? >> Second, unmapping read-only dirty PTEs often prevents TLB flush batching. >> See try_to_unmap_one(): >> >> /* >> * Page is dirty. Flush the TLB if a writable entry >> * potentially exists to avoid CPU writes after IO >> * starts and then write it out here. >> */ >> try_to_unmap_flush_dirty(); >> >> Similarly batching TLB flushed might be prevented in zap_pte_range(): >> >> if (!PageAnon(page)) { >> if (pte_dirty(ptent)) { >> force_flush = 1; >> set_page_dirty(page); >> } >> ... > > I still keep the pure question here (which I asked in the private reply to > you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty() > here. I'll rewrite what I have in the private email to you here again that > I think should be the write thing to do here.. > > if (!PageAnon(page)) { > if (pte_dirty(ptent)) { > set_page_dirty(page); > if (pte_write(ptent)) > force_flush = 1; > } > } The “Second” part regards a perf issue, not a correctness issue. As I told you offline, I think if makes sense to look both on pte_dirty() and pte_write(), but I am afraid of other architectures than x86, which I know. After our discussion, I looked at other architectures, and it does look safe. So I will add an additional patch for that (with your suggested-by). > I also mentioned the other example of doing mprotect(PROT_READ) upon a > dirty pte can also create a pte with dirty=1 and write=0 which should be > the same condition we have here with uffd, afaict. So it's at least not a > solo problem for uffd, and I still think if we treat this tlb flush issue > as a perf bug we should consider fixing up the tlb flush code instead > because otherwise the "mprotect(PROT_READ) upon dirty pte" will also be > able to hit it. > > Meanwhile, I'll have the same comment that this is not helping any reviewer > to understand why we need to grant user ability to conditionally set dirty > bit from uABI, so I think it's better to drop this paragraph too for this > patch alone. Ok. Point taken. > >> In general, setting a PTE as dirty seems for read-only entries might be >> dangerous. It should be reminded the dirty-COW vulnerability mitigation >> also relies on the dirty bit being set only after COW (although it does >> not appear to apply to userfaultfd). >> >> To summarize, setting the dirty bit for read-only PTEs is dangerous. But >> even if we only consider writable pages, always setting the dirty bit or >> always leaving it clear, does not seem as the best policy. Leaving the >> bit clear introduces overhead on the first write-access to set the bit. >> Setting the bit for pages the are eventually not written to can require >> more TLB flushes. > > And IMHO only this paragraph is the real and proper reasoning for this > patch.. Will do. > >> @@ -1902,10 +1908,10 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) >> uffdio_continue.range.start) { >> goto out; >> } >> - if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE) >> + if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE)) >> goto out; >> >> - uffd_flags = UFFD_FLAGS_ACCESS_LIKELY; >> + uffd_flags = UFFD_FLAGS_ACCESS_LIKELY | UFFD_FLAGS_WRITE_LIKELY; > > Setting dirty by default for CONTINUE may make some sense (unlike young), > at least to keep the old behavior of the code where the pte was > unconditionally set. > > But there's another thought a real CONTINUE user modifies the page cache > elsewhere so logically the PageDirty should be set elsewhere already > anyways, in most cases when the userapp is updating the page cache with > another mapping before installing pgtables for this mapping. Then this > dirty is not required, it seems. > > If we're going to export this to uABI, I'm wondering maybe it'll be nicer > to not apply either young or dirty bit for CONTINUE, because fundamentally > losing dirty bit doesn't sound risky, meanwhile the user app should have > the best knowledge of what to do (whether page was requested or it's just > during a pre-faulting in the background). > > Axel may have some better thoughts. I think that perhaps I did not communicate well enough the reason it makes sense to set the dirty-bit. Setting the access-bit and dirty-bit takes quite some time. So regardless of whether you set the PageDirty, if you didn’t see access-bit and dirty-bit and later you access the page - you are going to pay ~550 cycles for nothing. For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2 If you want me to allow userspace to provide a hint, let me know. >> @@ -85,13 +84,18 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, >> >> if (writable) >> _dst_pte = pte_mkwrite(_dst_pte); >> - else >> + else { >> /* >> * We need this to make sure write bit removed; as mk_pte() >> * could return a pte with write bit set. >> */ >> _dst_pte = pte_wrprotect(_dst_pte); >> >> + /* Marking RO entries as dirty can mess with other code */ >> + if (uffd_flags & UFFD_FLAGS_WRITE_LIKELY) >> + _dst_pte = pte_mkdirty(_dst_pte); > > Hmm.. what about "writable=true" ones? Oh boy, I reverted the condition… :( Thanks for noticing. I’ll fix it.