Re: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Jun 21, 2022, at 9:38 AM, Peter Xu <peterx@xxxxxxxxxx> wrote:

> Hi, Nadav,
> 
> On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote:
>> From: Nadav Amit <namit@xxxxxxxxxx>
>> 
>> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
>> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
>> However, setting read-only PTEs as dirty can have several undesired
>> implications.
>> 
>> First, setting read-only PTEs as dirty, can cause these PTEs to become
>> writable during mprotect() syscall. See in change_pte_range():
>> 
>> 	/* Avoid taking write faults for known dirty pages */
>> 	if (dirty_accountable && pte_dirty(ptent) &&
>> 			(pte_soft_dirty(ptent) ||
>> 			 !(vma->vm_flags & VM_SOFTDIRTY))) {
>> 		ptent = pte_mkwrite(ptent);
>> 	}
> 
> IMHO this is not really the direct reason to add the feature to "allow user
> to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because
> IIUC what's missing is the pte_uffd_wp() check that I should have added in
> the shmem+hugetlb uffd-wp series in change_pte_range() but I missed..
> 
> But since this is fixed by David's patch to optimize mprotect() altogether
> which checks pte_uffd_wp() (and afaict that's only needed after the
> shmem+hugetlb patchset, not before), so I think we're safe now with all the
> branches.
> 
> So IMHO we don't need to mention this as it's kind of misleading. It'll be
> welcomed if you want to recover the pte_dirty behavior in
> mfill_atomic_install_pte() but probably this is not the right patch for it?

Sorry, I will remove it from the commit log.

I am not sure whether there should be an additional patch to revert (or
partially revert) 9ae0f87d009ca and to be backported. What do you say?

>> Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
>> See try_to_unmap_one():
>> 
>> 	/*
>> 	 * Page is dirty. Flush the TLB if a writable entry
>> 	 * potentially exists to avoid CPU writes after IO
>> 	 * starts and then write it out here.
>> 	 */
>> 	try_to_unmap_flush_dirty();
>> 
>> Similarly batching TLB flushed might be prevented in zap_pte_range():
>> 
>> 	if (!PageAnon(page)) {
>> 		if (pte_dirty(ptent)) {
>> 			force_flush = 1;
>> 			set_page_dirty(page);
>> 		}
>> 	...
> 
> I still keep the pure question here (which I asked in the private reply to
> you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty()
> here. I'll rewrite what I have in the private email to you here again that
> I think should be the write thing to do here..
> 
> if (!PageAnon(page)) {
> if (pte_dirty(ptent)) {
> set_page_dirty(page);
> if (pte_write(ptent))
> force_flush = 1;
> }
> }

The “Second” part regards a perf issue, not a correctness issue. As I told
you offline, I think if makes sense to look both on pte_dirty() and
pte_write(), but I am afraid of other architectures than x86, which I know.
After our discussion, I looked at other architectures, and it does look
safe. So I will add an additional patch for that (with your suggested-by).

> I also mentioned the other example of doing mprotect(PROT_READ) upon a
> dirty pte can also create a pte with dirty=1 and write=0 which should be
> the same condition we have here with uffd, afaict. So it's at least not a
> solo problem for uffd, and I still think if we treat this tlb flush issue
> as a perf bug we should consider fixing up the tlb flush code instead
> because otherwise the "mprotect(PROT_READ) upon dirty pte" will also be
> able to hit it.
> 
> Meanwhile, I'll have the same comment that this is not helping any reviewer
> to understand why we need to grant user ability to conditionally set dirty
> bit from uABI, so I think it's better to drop this paragraph too for this
> patch alone.

Ok. Point taken.

> 
>> In general, setting a PTE as dirty seems for read-only entries might be
>> dangerous. It should be reminded the dirty-COW vulnerability mitigation
>> also relies on the dirty bit being set only after COW (although it does
>> not appear to apply to userfaultfd).
>> 
>> To summarize, setting the dirty bit for read-only PTEs is dangerous. But
>> even if we only consider writable pages, always setting the dirty bit or
>> always leaving it clear, does not seem as the best policy. Leaving the
>> bit clear introduces overhead on the first write-access to set the bit.
>> Setting the bit for pages the are eventually not written to can require
>> more TLB flushes.
> 
> And IMHO only this paragraph is the real and proper reasoning for this
> patch..

Will do.

> 
>> @@ -1902,10 +1908,10 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>> 	 uffdio_continue.range.start) {
>> 		goto out;
>> 	}
>> -	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
>> +	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE))
>> 		goto out;
>> 
>> -	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY;
>> +	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY | UFFD_FLAGS_WRITE_LIKELY;
> 
> Setting dirty by default for CONTINUE may make some sense (unlike young),
> at least to keep the old behavior of the code where the pte was
> unconditionally set.
> 
> But there's another thought a real CONTINUE user modifies the page cache
> elsewhere so logically the PageDirty should be set elsewhere already
> anyways, in most cases when the userapp is updating the page cache with
> another mapping before installing pgtables for this mapping. Then this
> dirty is not required, it seems.
> 
> If we're going to export this to uABI, I'm wondering maybe it'll be nicer
> to not apply either young or dirty bit for CONTINUE, because fundamentally
> losing dirty bit doesn't sound risky, meanwhile the user app should have
> the best knowledge of what to do (whether page was requested or it's just
> during a pre-faulting in the background).
> 
> Axel may have some better thoughts.

I think that perhaps I did not communicate well enough the reason it makes
sense to set the dirty-bit.

Setting the access-bit and dirty-bit takes quite some time. So regardless of
whether you set the PageDirty, if you didn’t see access-bit and dirty-bit
and later you access the page - you are going to pay ~550 cycles for
nothing.

For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2

If you want me to allow userspace to provide a hint, let me know.

>> @@ -85,13 +84,18 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>> 
>> 	if (writable)
>> 		_dst_pte = pte_mkwrite(_dst_pte);
>> -	else
>> +	else {
>> 		/*
>> 		 * We need this to make sure write bit removed; as mk_pte()
>> 		 * could return a pte with write bit set.
>> 		 */
>> 		_dst_pte = pte_wrprotect(_dst_pte);
>> 
>> +		/* Marking RO entries as dirty can mess with other code */
>> +		if (uffd_flags & UFFD_FLAGS_WRITE_LIKELY)
>> +			_dst_pte = pte_mkdirty(_dst_pte);
> 
> Hmm.. what about "writable=true" ones?

Oh boy, I reverted the condition… :(

Thanks for noticing. I’ll fix it.






[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux