Re: [PATCH v1 mmotm] mm/mprotect: try avoiding write faults for exclusive anonynmous pages when changing protection

Nadav Amit <nadav.amit@xxxxxxxxx> · Fri, 1 Apr 2022 12:15:13 -0700

[ +Rick ]

> On Apr 1, 2022, at 3:13 AM, David Hildenbrand <david@xxxxxxxxxx> wrote:
> 
> Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
> can try mapping anonymous pages writable if they are exclusive,
> the PTE is already dirty, and no special handling applies. Mapping the
> PTE writable is essentially the same thing the write fault handler would do
> in this case.

In general I am all supportive for such a change.

I do have some mostly-minor concerns.

> 
> +static inline bool can_change_pte_writable(struct vm_area_struct *vma,
> +					   unsigned long addr, pte_t pte,
> +					   unsigned long cp_flags)
> +{
> +	struct page *page;
> +
> +	if ((vma->vm_flags & VM_SHARED) && !(cp_flags & MM_CP_DIRTY_ACCT))
> +		/*
> +		 * MM_CP_DIRTY_ACCT is only expressive for shared mappings;
> +		 * without MM_CP_DIRTY_ACCT, there is nothing to do.
> +		 */
> +		return false;
> +
> +	if (!(vma->vm_flags & VM_WRITE))
> +		return false;
> +
> +	if (pte_write(pte) || pte_protnone(pte) || !pte_dirty(pte))
> +		return false;

If pte_write() is already try then return false? I understand you want
to do so because the page is already writable, but it is confusing.

In addition, I am not sure about the pte_dirty() check is really robust.
I mean I think it is ok, but is there any issue with shadow-stack? 

And this also assumes the kernel does not clear the dirty bit without
clearing the present, as otherwise the note in Intel SDM section 4.8
("Accessed and Dirty Flags”) will be relevant and dirty bit might be
set unnecessarily. I think it is ok.

> +
> +	/* Do we need write faults for softdirty tracking? */
> +	if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !pte_soft_dirty(pte) &&
> +	    (vma->vm_flags & VM_SOFTDIRTY))

If !IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) then VM_SOFTDIRTY == 0. So I do not
think the IS_ENABLED() is necessary (unless you think it is clearer this
way).

> +		return false;
> +
> +	/* Do we need write faults for uffd-wp tracking? */
> +	if (userfaultfd_pte_wp(vma, pte))
> +		return false;
> +
> +	if (!(vma->vm_flags & VM_SHARED)) {
> +		/*
> +		 * We can only special-case on exclusive anonymous pages,
> +		 * because we know that our write-fault handler similarly would
> +		 * map them writable without any additional checks while holding
> +		 * the PT lock.
> +		 */
> +		page = vm_normal_page(vma, addr, pte);

I guess we cannot call vm_normal_page() twice, once for prot_numa and once
here, in practice...

> +		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
> +			return false;
> +	}
> +
> +	return true;
> +}

Note that there is a small downside to all of that. Assume you mprotect()
a single page from RO to RW and you have many threads.

With my pending patch you would avoid the TLB shootdown (and get a PF).
With this patch you would get a TLB shootdown and save the PF. IOW, I
think it is worthy to skip the shootdown as well in such a case and
instead flush the TLB on spurious page-faults. But I guess that’s for
another patch.