Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

Nadav Amit <nadav.amit@xxxxxxxxx> · Wed, 23 Dec 2020 14:45:59 -0800

> On Dec 23, 2020, at 2:05 PM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
> 
> On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote:
>>> On Dec 21, 2020, at 1:24 PM, Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
>>> 
>>> On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
>>>> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit <nadav.amit@xxxxxxxxx> wrote:
>>>>> Using mmap_write_lock() was my initial fix and there was a strong pushback
>>>>> on this approach due to its potential impact on performance.
>>>> 
>>>> From whom?
>>>> 
>>>> Somebody who doesn't understand that correctness is more important
>>>> than performance? And that userfaultfd is not the most important part
>>>> of the system?
>>>> 
>>>> The fact is, userfaultfd is CLEARLY BUGGY.
>>>> 
>>>>         Linus
>>> 
>>> Fair enough.
>>> 
>>> Nadav, for your patch (you might want to update the commit message).
>>> 
>>> Reviewed-by: Yu Zhao <yuzhao@xxxxxxxxxx>
>>> 
>>> While we are all here, there is also clear_soft_dirty() that could
>>> use a similar fix…
>> 
>> Just an update as for why I have still not sent v2: I fixed
>> clear_soft_dirty(), created a reproducer, and the reproducer kept failing.
>> 
>> So after some debugging, it appears that clear_refs_write() does not flush
>> the TLB. It indeed calls tlb_finish_mmu() but since 0758cd830494
>> ("asm-generic/tlb: avoid potential double flush”), tlb_finish_mmu() does not
>> flush the TLB since there is clear_refs_write() does not call to
>> __tlb_adjust_range() (unless there are nested TLBs are pending).
>> 
>> So I have a patch for this issue too: arguably the tlb_gather interface is
>> not the right one for clear_refs_write() that does not clear PTEs but
>> changes them.
>> 
>> Yet, sadly, my reproducer keeps falling (less frequently, but still). So I
>> will keep debugging to see what goes wrong. I will send v2 once I figure out
>> what the heck is wrong in the code or my reproducer.
> 
> If you put the page_mapcount check back in do_wp_page instead of
> page_count, it'll stop reproducing but the bug is still very much
> there...

I know. I designed the (re)producer just to be able to hit the bug without
changing the kernel (and test my fix), but I am fully aware that it would
not work on older kernels although the bug is still there.

> And then we'll have to enforce uffd-wp cannot be registered if
> VM_SOFTDIRTY is set or the other way around so that VM_UFFD* is
> mutually exclusive with VM_SOFTDIRTY. So then we can also unify the
> bit so they all use the same software bit in the pgtable (that's
> something I considered anyway originally since it doesn't make whole
> lot of sense to use the two features on the same vma at the same
> time).

I think it may be reasonable.

Just a proposal: At some point we can also ask ourselves whether the
“artificial" limitation of the number of software bits per PTE should really
limit us, or do we want to hold some additional metadata per-PTE by either
putting it in an adjacent page (holding 64-bits of additional software-bits
per PTE) or by finding some place in the page-struct to link to this
metadata (and have the liberty of number of bits per PTE). One of the PTE
software-bits can be repurposed to say whether there is “extra-metadata”
associated with the PTE.

I am fully aware that there will be some overhead associated, but it
can be limited to less-common use-cases.