Re: [PATCH v1 mmotm] mm/mprotect: try avoiding write faults for exclusive anonynmous pages when changing protection

David Hildenbrand <david@xxxxxxxxxx> · Mon, 4 Apr 2022 15:04:28 +0200

On 01.04.22 21:15, Nadav Amit wrote:
> [ +Rick ]
> 
>> On Apr 1, 2022, at 3:13 AM, David Hildenbrand <david@xxxxxxxxxx> wrote:
>>
>> Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
>> can try mapping anonymous pages writable if they are exclusive,
>> the PTE is already dirty, and no special handling applies. Mapping the
>> PTE writable is essentially the same thing the write fault handler would do
>> in this case.
> 
> In general I am all supportive for such a change.
> 
> I do have some mostly-minor concerns.

Hi Nadav,

thanks a lot for your review!

> 
>>
>> +static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>> +					   unsigned long addr, pte_t pte,
>> +					   unsigned long cp_flags)
>> +{
>> +	struct page *page;
>> +
>> +	if ((vma->vm_flags & VM_SHARED) && !(cp_flags & MM_CP_DIRTY_ACCT))
>> +		/*
>> +		 * MM_CP_DIRTY_ACCT is only expressive for shared mappings;
>> +		 * without MM_CP_DIRTY_ACCT, there is nothing to do.
>> +		 */
>> +		return false;
>> +
>> +	if (!(vma->vm_flags & VM_WRITE))
>> +		return false;
>> +
>> +	if (pte_write(pte) || pte_protnone(pte) || !pte_dirty(pte))
>> +		return false;
> 
> If pte_write() is already try then return false? I understand you want
> to do so because the page is already writable, but it is confusing.

I thought about just doing outside of the function

if ((vma->vm_flags & VM_WRITE) && !pte_write(pte) &&
    can_change_pte_writable()...

I refrained from doing so because the sequence of checks might be
sub-optimal. But most probably we don't really care about that and it
might make the code easier to grasp.

Would that make it clearer?

> 
> In addition, I am not sure about the pte_dirty() check is really robust.
> I mean I think it is ok, but is there any issue with shadow-stack? 

Judging that it's already used that way for VMAs with dirty tracking, I
assume it's ok. Without checking that the PTE is dirty, we'd have to do a:

pte_mkwrite(pte_mkwrite(ptent));

Which would set the pte and consequently the page dirty, although there
might not even be a write access. That's what we want to avoid here.

> 
> And this also assumes the kernel does not clear the dirty bit without
> clearing the present, as otherwise the note in Intel SDM section 4.8
> ("Accessed and Dirty Flags”) will be relevant and dirty bit might be
> set unnecessarily. I think it is ok.

Yeah, I think so as well.

> 
>> +
>> +	/* Do we need write faults for softdirty tracking? */
>> +	if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !pte_soft_dirty(pte) &&
>> +	    (vma->vm_flags & VM_SOFTDIRTY))
> 
> If !IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) then VM_SOFTDIRTY == 0. So I do not
> think the IS_ENABLED() is necessary (unless you think it is clearer this
> way).

Right, we can just do

if ((vma->vm_flags & VM_SOFTDIRTY) && !pte_soft_dirty(pte))

and it should get fully optimized out. Thanks!

> 
>> +		return false;
>> +
>> +	/* Do we need write faults for uffd-wp tracking? */
>> +	if (userfaultfd_pte_wp(vma, pte))
>> +		return false;
>> +
>> +	if (!(vma->vm_flags & VM_SHARED)) {
>> +		/*
>> +		 * We can only special-case on exclusive anonymous pages,
>> +		 * because we know that our write-fault handler similarly would
>> +		 * map them writable without any additional checks while holding
>> +		 * the PT lock.
>> +		 */
>> +		page = vm_normal_page(vma, addr, pte);
> 
> I guess we cannot call vm_normal_page() twice, once for prot_numa and once
> here, in practice...

I guess we could, but it doesn't necessarily make the code easier to
read :) And we want to skip protnone either way.

> 
>> +		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
>> +			return false;
>> +	}
>> +
>> +	return true;
>> +}
> 
> Note that there is a small downside to all of that. Assume you mprotect()
> a single page from RO to RW and you have many threads.
> 
> With my pending patch you would avoid the TLB shootdown (and get a PF).
> With this patch you would get a TLB shootdown and save the PF. IOW, I
> think it is worthy to skip the shootdown as well in such a case and
> instead flush the TLB on spurious page-faults. But I guess that’s for
> another patch.

Just so I understand correctly: your optimization avoids the flush when
effectively, nothing changed (R/O -> R/O).

And the optimization for this case here would be, to avoid the TLB flush
when similarly not required (R/O -> R/W).

Correct?

-- 
Thanks,

David / dhildenb