Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

Nadav Amit <nadav.amit@xxxxxxxxx> · Tue, 22 Dec 2020 12:58:18 -0800

> On Dec 22, 2020, at 12:34 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> 
> On Sat, Dec 19, 2020 at 2:06 PM Nadav Amit <nadav.amit@xxxxxxxxx> wrote:
>>> [ I have in mind another solution, such as keeping in each page-table a
>>> “table-generation” which is the mm-generation at the time of the change,
>>> and only flush if “table-generation”==“mm-generation”, but it requires
>>> some thought on how to avoid adding new memory barriers. ]
>>> 
>>> IOW: I think the change that you suggest is insufficient, and a proper
>>> solution is too intrusive for “stable".
>>> 
>>> As for performance, I can add another patch later to remove the TLB flush
>>> that is unnecessarily performed during change_protection_range() that does
>>> permission promotion. I know that your concern is about the “protect” case
>>> but I cannot think of a good immediate solution that avoids taking mmap_lock
>>> for write.
>>> 
>>> Thoughts?
>> 
>> On a second thought (i.e., I don’t know what I was thinking), doing so —
>> checking mm_tlb_flush_pending() on every PTE read which is potentially
>> dangerous and flushing if needed - can lead to huge amount of TLB flushes
>> and shootodowns as the counter might be elevated for considerable amount of
>> time.
> 
> I've lost track as to whether we still think that this particular
> problem is really a problem,

If you mean “problem” as to whether there is a correctness issue with
userfaultfd and soft-dirty deferred flushes under mmap_read_lock() - yes
there is a problem and I produced these failures on upstream.

If you mean “problem” as to performance - I do not know.

> but could we perhaps make the
> tlb_flush_pending field be per-ptl instead of per-mm?  Depending on
> how it gets used, it could plausibly be done without atomics or
> expensive barriers by using PTL to protect the field.
> 
> FWIW, x86 has a mm generation counter, and I don't think it would be
> totally crazy to find a way to expose an mm generation to core code.
> I don't think we'd want to expose the specific data structures that
> x86 uses to track it -- they're very tailored to the oddities of x86
> TLB management.  x86 also doesn't currently have any global concept of
> which mm generation is guaranteed to have been propagated to all CPUs
> -- we track the generation in the pagetables and, per cpu, the
> generation that we know that CPU has seen.  x86 could offer a function
> "ensure that all CPUs catch up to mm generation G and don't return
> until this happens" and its relative "have all CPUs caught up to mm
> generation G", but these would need to look at data from multiple CPUs
> and would probably be too expensive on very large systems to use in
> normal page faults unless we were to cache the results somewhere.
> Making a nice cache for this is surely doable, but maybe more
> complexity than we'd want.

I had somewhat similar ideas - saving in each page-struct the generation,
which would allow to: (1) extend pte_same() to detect interim changes
that were reverted (RO->RW->RO) and (2) per-PTE pending flushes.

Obviously, I cannot do it as part of this fix. But moreover, it seems to me
that it would require a memory barrier after updating the PTEs and before
reading the current generation (that would be saved per page-table). I
try to think about schemes that would use the per-CPU generation instead,
but still could not and did not have the time to figure it out.