On Jun 21, 2022, at 11:10 AM, Peter Xu <peterx@xxxxxxxxxx> wrote: > On Tue, Jun 21, 2022 at 10:14:00AM -0700, Nadav Amit wrote: >> On Jun 21, 2022, at 9:38 AM, Peter Xu <peterx@xxxxxxxxxx> wrote: >> >>> Hi, Nadav, >>> >>> On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote: >>>> From: Nadav Amit <namit@xxxxxxxxxx> >>>> >>>> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in >>>> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates. >>>> However, setting read-only PTEs as dirty can have several undesired >>>> implications. >>>> >>>> First, setting read-only PTEs as dirty, can cause these PTEs to become >>>> writable during mprotect() syscall. See in change_pte_range(): >>>> >>>> /* Avoid taking write faults for known dirty pages */ >>>> if (dirty_accountable && pte_dirty(ptent) && >>>> (pte_soft_dirty(ptent) || >>>> !(vma->vm_flags & VM_SOFTDIRTY))) { >>>> ptent = pte_mkwrite(ptent); >>>> } >>> >>> IMHO this is not really the direct reason to add the feature to "allow user >>> to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because >>> IIUC what's missing is the pte_uffd_wp() check that I should have added in >>> the shmem+hugetlb uffd-wp series in change_pte_range() but I missed.. >>> >>> But since this is fixed by David's patch to optimize mprotect() altogether >>> which checks pte_uffd_wp() (and afaict that's only needed after the >>> shmem+hugetlb patchset, not before), so I think we're safe now with all the >>> branches. >>> >>> So IMHO we don't need to mention this as it's kind of misleading. It'll be >>> welcomed if you want to recover the pte_dirty behavior in >>> mfill_atomic_install_pte() but probably this is not the right patch for it? >> >> Sorry, I will remove it from the commit log. >> >> I am not sure whether there should be an additional patch to revert (or >> partially revert) 9ae0f87d009ca and to be backported. What do you say? > > I think we discussed it offlist, I'd not bother but I don't have a strong > opinion. Please feel free to go with your preference. > > Just to mention that it might not be a clean revert since at that time we > don't have continue and uffd-wp on files, so we may need to add the > complete check back if we're going to make it. Please also do proper swap > tests for both anon and shmem with things like memcg, and when you posted > the patches I can do some tests too. That’s what I was afraid of. It moves it down in my priority list... >>>> Second, unmapping read-only dirty PTEs often prevents TLB flush batching. >>>> See try_to_unmap_one(): >>>> >>>> /* >>>> * Page is dirty. Flush the TLB if a writable entry >>>> * potentially exists to avoid CPU writes after IO >>>> * starts and then write it out here. >>>> */ >>>> try_to_unmap_flush_dirty(); >>>> >>>> Similarly batching TLB flushed might be prevented in zap_pte_range(): >>>> >>>> if (!PageAnon(page)) { >>>> if (pte_dirty(ptent)) { >>>> force_flush = 1; >>>> set_page_dirty(page); >>>> } >>>> ... >>> >>> I still keep the pure question here (which I asked in the private reply to >>> you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty() >>> here. I'll rewrite what I have in the private email to you here again that >>> I think should be the write thing to do here.. >>> >>> if (!PageAnon(page)) { >>> if (pte_dirty(ptent)) { >>> set_page_dirty(page); >>> if (pte_write(ptent)) >>> force_flush = 1; >>> } >>> } >> >> The “Second” part regards a perf issue, not a correctness issue. As I told >> you offline, I think if makes sense to look both on pte_dirty() and >> pte_write(), but I am afraid of other architectures than x86, which I know. > > I don't really see why this logic is arch-specific. I meant, as long as > pte_write() returns a value that means "this page is writable", I still > think it's making sense. > >> After our discussion, I looked at other architectures, and it does look >> safe. So I will add an additional patch for that (with your suggested-by). > > Only if you agree with what I thought, thanks. And that could be a > separate patch for sure out of this one even if to come. I do agree. I did not quite understand your second sentence. I guess you meant not to combine it into this patchset (or at least that’s what I thought and I attribute it to you). >> >> >> I think that perhaps I did not communicate well enough the reason it makes >> sense to set the dirty-bit. >> >> Setting the access-bit and dirty-bit takes quite some time. So regardless of >> whether you set the PageDirty, if you didn’t see access-bit and dirty-bit >> and later you access the page - you are going to pay ~550 cycles for >> nothing. >> >> For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2 >> >> If you want me to allow userspace to provide a hint, let me know. > > You did communicate well, so probably it's me that didn't. :) > > Yes I think it'll be nicer to allow userspace provide the same hint for > UFFDIO_CONTINUE, the reason as I explained in the other thread on the young > bit (or say ACCESS_LIKELY), that UFFDIO_CONTINUE can (at least in my > current qemu impl [1]) proactively be used to install pgtables even if > they're code pages and they may not be accessed in the near future. So > IMHO we should treat UFFDIO_CONTINUE the same as UFFDIO_COPY, IMHO, from > this point of view. > > There's just still some differences on young/dirty here so I'm not sure > whether the idea applies to both, but I think at least it applies to young > bit (ACCESS_LIKELY). > > [1] https://github.com/xzpeter/qemu/commit/b7758ad55af42b5364796362e96f4a06b51d9582 We have a saying in Hebrew: “When you explain slowly, I understand quickly”. Thanks for explaining. I will add hints for UFFDIO_CONTINUE as well.