On Tue, Dec 22, 2020 at 12:58:18PM -0800, Nadav Amit wrote: > I had somewhat similar ideas - saving in each page-struct the generation, > which would allow to: (1) extend pte_same() to detect interim changes > that were reverted (RO->RW->RO) and (2) per-PTE pending flushes. What don't you feel safe about, what's the problem with RO->RO->RO, I don't get it. The pte_same is perfectly ok without sequence counter in my view, I never seen anything that would not be ok with pte_same given all the invariant are respected. It's actually a great optimization compared to any unscalable sequence counter. The counter would slowdown everything, having to increase a counter every time you change a pte, no matter if it's a counter per pgtable or per-vma or per-mm, sounds very bad. I'd rather prefer to take mmap_lock_write across the whole userfaultfd ioctl, than having to deal with a new sequence counter increase for every pte modification on a heavily contended cacheline. Also note the counter would have solved nothing for userfaultfd_writeprotect, it's useless to detect stale TLB entries. See how !pte_write check happens after the counter was already increased: CPU0 CPU 1 CPU 2 ------ -------- ------- userfaultfd_wrprotect(mode_wp = true) PT lock atomic set _PAGE_UFFD_WP and clear _PAGE_WRITE false_shared_counter_counter++ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ PT unlock do_page_fault FAULT_FLAG_WRITE userfaultfd_wrprotect(mode_wp = false) PT lock ATOMIC clear _PAGE_UFFD_WP <- problem /* _PAGE_WRITE not set */ false_shared_counter_counter++ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ PT unlock XXXXXXXXXXXXXX BUG RACE window open here PT lock counter = false_shared_counter_counter ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ FAULT_FLAG_WRITE is set by CPU _PAGE_WRITE is still clear in pte PT unlock wp_page_copy copy_user_page runs with stale TLB pte_same(counter, orig_pte, pte) -> PASS ^^^^^^^ ^^^^ commit the copy to the pte with the lost writes deferred tlb flush <- too late XXXXXXXXXXXXXX BUG RACE window close here ================================================================================