On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote: > > On Dec 21, 2020, at 1:24 PM, Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > > > > On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote: > >> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit <nadav.amit@xxxxxxxxx> wrote: > >>> Using mmap_write_lock() was my initial fix and there was a strong pushback > >>> on this approach due to its potential impact on performance. > >> > >> From whom? > >> > >> Somebody who doesn't understand that correctness is more important > >> than performance? And that userfaultfd is not the most important part > >> of the system? > >> > >> The fact is, userfaultfd is CLEARLY BUGGY. > >> > >> Linus > > > > Fair enough. > > > > Nadav, for your patch (you might want to update the commit message). > > > > Reviewed-by: Yu Zhao <yuzhao@xxxxxxxxxx> > > > > While we are all here, there is also clear_soft_dirty() that could > > use a similar fix… > > Just an update as for why I have still not sent v2: I fixed > clear_soft_dirty(), created a reproducer, and the reproducer kept failing. > > So after some debugging, it appears that clear_refs_write() does not flush > the TLB. It indeed calls tlb_finish_mmu() but since 0758cd830494 > ("asm-generic/tlb: avoid potential double flush”), tlb_finish_mmu() does not > flush the TLB since there is clear_refs_write() does not call to > __tlb_adjust_range() (unless there are nested TLBs are pending). > > So I have a patch for this issue too: arguably the tlb_gather interface is > not the right one for clear_refs_write() that does not clear PTEs but > changes them. > > Yet, sadly, my reproducer keeps falling (less frequently, but still). So I > will keep debugging to see what goes wrong. I will send v2 once I figure out > what the heck is wrong in the code or my reproducer. If you put the page_mapcount check back in do_wp_page instead of page_count, it'll stop reproducing but the bug is still very much there... It's a feature page_count finally shows you the corruption now by virtue of the page_count being totally unreliable with all speculative pagecache lookups randomly elevating it in the background. The proof it worked by luck is that an unrelated change (s/page_mapcount/page_count/) made the page fault behave slightly different and broke clear_refs_write. Even before page_mapcount was replaced with page_count, it has always been forbidden to leave too permissive stale TLB entries out of sync with a more restrictive pte/hugepmd permission past the PT unlock, unless you're holding the mmap_write_lock. So for example all rmap code has to flush before PT unlock release too, usually it clears the pte as a whole but it's still a downgrade. The rmap_lock and the mmap_read_lock achieve the same: they keep the vma stable but they can't stop the page fault from running (that's a feature) so they have to flush inside the PT lock. The tlb gather deals with preventing use after free (where userland can modify kernel memory), but it cannot deal with the guarantee the page fault requires. So the clear_refs_write patch linked that alters the tlb flushing appears a noop with respect to this bug. It cannot do anything to prevent the page fault run with writable stale TLB entries with the !pte_write. If you don't add a marker here (it clears it, the exact opposite of what should be happening), there's no way to avoid the mmap_write_lock in my view. static inline void clear_soft_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *pte) { /* * The soft-dirty tracker uses #PF-s to catch writes * to pages, so write-protect the pte as well. See the * Documentation/admin-guide/mm/soft-dirty.rst for full description * of how soft-dirty works. */ pte_t ptent = *pte; if (pte_present(ptent)) { pte_t old_pte; old_pte = ptep_modify_prot_start(vma, addr, pte); ptent = pte_wrprotect(old_pte); ptent = pte_clear_soft_dirty(ptent); + ptent = pte_mkuffd_wp(ptent); ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent); One solution that would fix the userland mm corruption in clear_refs_write is to take the mmap_read_lock, take some mutex somewhere (vma/mm whatever), then in clear_soft_dirty make the above modification adding the _PAGE_UFFD_WP, then flush tlb, release the mutex and then release the mmap_read_lock. Then here: if (userfaultfd_pte_wp(vma, *vmf->pte)) { pte_unmap_unlock(vmf->pte, vmf->ptl); + if (vma->vm_flags & VM_SOFTDIRTY) + return handle_soft_dirty(vma); return handle_userfault(vmf, VM_UFFD_WP); Of course handle_soft_dirty will have to take the mutex once (1 mutex_lock/unlock cycle to run after any pending flush). And then we'll have to enforce uffd-wp cannot be registered if VM_SOFTDIRTY is set or the other way around so that VM_UFFD* is mutually exclusive with VM_SOFTDIRTY. So then we can also unify the bit so they all use the same software bit in the pgtable (that's something I considered anyway originally since it doesn't make whole lot of sense to use the two features on the same vma at the same time). If the above is too complex, clear_refs_write will have to grind down to I/O disk spindle speed as mprotect with s/mmap_read_lock/mmap_write_lock/, unless it stops triggering wrprotect faults altogether. Thanks, Andrea