On Wed, Dec 23, 2020 at 04:39:00PM -0500, Andrea Arcangeli wrote: > On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote: > > Thanks for the details. > > I hope we can find a way put the page_mapcount back where there's a > page_count right now. > > If you're so worried about having to maintain a all defined well > documented (or to be documented even better if you ACK it) > marker/catcher for userfaultfd_writeprotect, I can't see how you could > consider to maintain the page fault safe against any random code > leaving too permissive TLB entries out of sync of the more restrictive > pte permissions as it was happening with clear_refs_write, which > worked by luck until page_mapcount was changed to page_count. > > page_count is far from optimal, but it is a feature it finally allowed > us to notice that various code (clear_refs_write included apparently > even after the fix) leaves stale too permissive TLB entries when it > shouldn't. > > The question is only which way you prefer to fix clear_refs_write and > I don't think we can deviate from those 3 methods that already exist > today. So clear_refs_write will have to pick one of those and > currently it's not falling in the same category with mprotect even > after the fix. > > I think if clear_refs_write starts to take the mmap_write_lock and > really start to operate like mprotect, only then we can consider to > make userfaultfd_writeprotect also operate like mprotect. > > Even then I'd hope we can at least be allowed to make it operate like > KSM write_protect_page for len <= HPAGE_PMD_SIZE, something that > clear_refs_write cannot do since it works in O(N) and tends to scan > everything at once, so there would be no point to optimize not to > defer the flush, for a process with a tiny amount of virtual memory > mapped. > > vm86 also should be fixed to fall in the same category with mprotect, > since performance there is irrelevant. I was hesitant to suggest the following because it isn't that straight forward. But since you seem to be less concerned with the complexity, I'll just bring it on the table -- it would take care of both ufd and clear_refs_write, wouldn't it? diff --git a/mm/memory.c b/mm/memory.c index 5e9ca612d7d7..af38c5ee327e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) goto unlock; } if (vmf->flags & FAULT_FLAG_WRITE) { - if (!pte_write(entry)) + if (!pte_write(entry)) { + if (mm_tlb_flush_pending(vmf->vma->vm_mm)) + flush_tlb_page(vmf->vma, vmf->address); return do_wp_page(vmf); + } entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry);