Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

Nadav Amit <nadav.amit@xxxxxxxxx> · Tue, 12 Jan 2021 12:38:34 -0800

> On Jan 12, 2021, at 11:56 AM, Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> 
> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
>>> On Jan 12, 2021, at 11:02 AM, Laurent Dufour <ldufour@xxxxxxxxxxxxxxxxxx> wrote:
>>> 
>>> Le 12/01/2021 à 17:57, Peter Zijlstra a écrit :
>>>> On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
>>>>> Le 12/01/2021 à 12:43, Vinayak Menon a écrit :
>>>>>> Possibility of race against other PTE modifiers
>>>>>> 
>>>>>> 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
>>>>>> is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
>>>> Right, that's exactly the kind of thing I was worried about.
>>>>>> 2) mprotect - change_protection in mprotect which does the deferred flush is
>>>>>> marked under vm_write_begin/vm_write_end, thus SPF bails out on faults
>>>>>> on those VMAs.
>>>> Sure, mprotect also changes vm_flags, so it really needs that anyway.
>>>>>> 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
>>>>>> But SPF does not take UFFD faults.
>>>>>> 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
>>>>>> (2) above.
>>>>>> 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
>>>> What happened to shared/file-backed stuff? ISTR I had that working.
>>> 
>>> File-backed mappings are not processed in a speculative way, there were options to manage some of them depending on the underlying file system but that's still not done.
>>> 
>>> Shared anonymous mapping, are also not yet handled in a speculative way (vm_ops is not null).
>>> 
>>>>>> Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
>>>>>> transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
>>>> The tricky one is demotion, specifically write to non-write.
>>>>>> I could not see a case where speculative path cannot see a PTE update done via
>>>>>> a fault on another CPU.
>>>> One you didn't mention is the NUMA balancing scanning crud; although I
>>>> think that's fine, loosing a PTE update there is harmless. But I've not
>>>> thought overly hard on it.
>>> 
>>> That's a good point, I need to double check on that side.
>>> 
>>>>> You explained it fine. Indeed SPF is handling deferred TLB invalidation by
>>>>> marking the VMA through vm_write_begin/end(), as for the fork case you
>>>>> mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE
>>>>> values read are valid.
>>>> That should indeed work, but are we really sure we covered them all?
>>>> Should we invest in better TLBI APIs to make sure we can't get this
>>>> wrong?
>>> 
>>> That may be a good option to identify deferred TLB invalidation but I've no clue on what this API would look like.
>> 
>> I will send an RFC soon for per-table deferred TLB flushes tracking.
>> The basic idea is to save a generation in the page-struct that tracks
>> when deferred PTE change took place, and track whenever a TLB flush
>> completed. In addition, other users - such as mprotect - would use
>> the tlb_gather interface.
>> 
>> Unfortunately, due to limited space in page-struct this would only
>> be possible for 64-bit (and my implementation is only for x86-64).
> 
> I don't want to discourage you but I don't think this would end up
> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
> and I've run into problems with this before while trying to do
> something similar.

Discourage, discourage. Better now than later.

It will be relatively easy to extend the scheme to be per-VMA instead of
per-table for architectures that prefer it this way. It does require
TLB-generation tracking though, which Andy only implemented for x86, so I
will focus on x86-64 right now.

[ For per-VMA it would require an additional cmpxchg, I presume to save the
last deferred generation though. ]

> I'd recommend per-vma and per-category (unmapping, clearing writable
> and clearing dirty) tracking, which only rely on arch-independent data
> structures, i.e., vm_area_struct and mm_struct.

I think that tracking changes on “what was changed” granularity is harder
and more fragile.

Let me finish trying the clean up the mess first, since fullmm and
need_flush_all semantics were mixed; there are 3 different flushing schemes
for mprotect(), munmap() and try_to_unmap(); there are missing memory
barriers; mprotect() performs TLB flushes even when permissions are
promoted; etc.

There are also some optimizations that we discussed before, such on x86 - 
RW->RO does not require a TLB flush if a PTE is not dirty, etc.

I am trying to finish something so you can say how terrible it is, so I will
not waste too much time. ;-)

>> It would still require to do the copying while holding the PTL though.
> 
> IMO, this is unacceptable. Most archs don't support per-table PTL, and
> even x86_64 can be configured to use per-mm PTL. What if we want to
> support a larger page size in the feature?
> 
> It seems to me the only way to solve the problem with self-explanatory
> code and without performance impact is to check mm_tlb_flush_pending
> and the writable bit (and two other cases I mentioned above) at the
> same time. Of course, this requires a lot of effort to audit the
> existing uses, clean them up and properly wrap them up with new
> primitives, BUG_ON all invalid cases and document the exact workflow
> to prevent misuses.
> 
> I've mentioned the following before -- it only demonstrates the rough
> idea.
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 5e9ca612d7d7..af38c5ee327e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> 		goto unlock;
> 	}
> 	if (vmf->flags & FAULT_FLAG_WRITE) {
> -		if (!pte_write(entry))
> +		if (!pte_write(entry)) {
> +			if (mm_tlb_flush_pending(vmf->vma->vm_mm))
> +				flush_tlb_page(vmf->vma, vmf->address);
> 			return do_wp_page(vmf);
> +		}
> 		entry = pte_mkdirty(entry);
> 	}
> 	entry = pte_mkyoung(entry);

I understand. This might be required, regardless of the deferred flushes
detection scheme. If we assume that no write-unprotect requires a COW (which
should be true in this case, since we take a reference on the page), your
proposal should be sufficient.

Still, I think that there are many unnecessary TLB flushes right now,
and others that might be missed due to the overly complicated invalidation
schemes. 

Regardless, as Andrea pointed, this requires first to figure out the
semantics of mprotect() and friends when pages are pinned.