Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

Nadav Amit <nadav.amit@xxxxxxxxx> · Mon, 21 Dec 2020 14:55:12 -0800

> On Dec 21, 2020, at 2:30 PM, Peter Xu <peterx@xxxxxxxxxx> wrote:
> 
> On Mon, Dec 21, 2020 at 01:49:55PM -0800, Nadav Amit wrote:
>> BTW: In general, I think that you are right, and that changing of PTEs
>> should not require taking mmap_lock for write. However, I am not sure
>> cow_user_page() is not the only one that poses a problem and whether a more
>> systematic solution is needed. If cow_user_pages() is the only problem, do
>> you think it is possible to do the copying while holding the PTL? It works
>> for normal-pages, but I am not sure whether special-pages pose special
>> problems.
>> 
>> Anyhow, this is an enhancement that we can try later.
> 
> AFAIU mprotect() is the only one who modifies the pte using the mmap write
> lock.  NUMA balancing is also using read mmap lock when changing pte
> protections, while my understanding is mprotect() used write lock only because
> it manipulates the address space itself (aka. vma layout) rather than modifying
> the ptes, so it needs to.

You are correct about NUMA balancing in general. Yet in practice it is not
an issue in our “use-case” since NUMA balancing preserves the write-bit.

> At the pte level, it seems always to be the pgtable lock that serializes things.
> 
> So it's perfectly legal to me for e.g. a driver to modify ptes with the read
> lock of mmap_sem, unless I'm severely mistaken.. as long as the pgtable lock is
> taken when doing so.
> 
> If there's a driver that manipulated the ptes, changed the content of the page,
> recover the ptes to origin, and all these happen right after wp_page_copy()
> unlocked the pgtable lock but before wp_page_copy() retakes the same lock
> again, we may face the same issue finding that the page got copied contains
> corrupted data at last.  While I don't know what to blame on the driver either
> because it seems to be exactly following the rules.

The driver would have to do so without flushing the TLB. Having said that,
the driver could have used inc_tlb_flush_pending() and batch flushes.

> 
> I believe changing into write lock would solve the race here because tlb
> flushing would be guaranteed along the way, but I'm just a bit worried it's not
> the best way to go..

It might be too big of a hammer. But the question that comes to my mind is,
if it is ok to change the PTEs without mmap_lock held for write, why
wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so
mprotect change of PTE would not be done with the write-lock)? If you did
so, you would have the same problem as the one we encountered (concurrent
protect-unprotect allow concurrent cow-#PF copying the wrong data).

So as an alternative solution, I can do copying under the PTL after
flushing, which seems to solve the problem. First copying (without a lock)
and then comparing seems to me as suboptimal - I do not think the benefit
(if there is one) of shortening the time in which the lock is taken - worth
the additional compare (and the complexity with special pages).

There are 2 problems in doing so:

1. I think that copy_user_highpage() and __copy_from_user_inatomic() can be
called while holding the PTL, but I am not sure.

2. For special pages we would need 2 TLB flushes: one to ensure the
write-bit is cleared, and a second one after we clear the PTE. We
can limit ourselves to soft-dirty/UFFD VMAs, but if we have your
hypothetical driver, this would not be good enough.