On Thu, Sep 17, 2020 at 11:11:06AM -0700, Linus Torvalds wrote: > (a) if the pinner is going to change the page, it will have to get > the pin with FOLL_WRITE in addition to FOLL_PIN > > (b) that means it will do the COW and mark the page writable and dirty Yep > (c) the whole _point_ of the FOLL_PIN is that subsequent operations > shouldn't make it non-writable any more (ie it can't be unmapped, and > we should synchronize on fork etc) It is the ideal, but FOLL_PIN has been troubled for a long time: https://lwn.net/Articles/753027/ ie writeprotect is known to happen due to writeback. I had understood that fork() will also cause write protect. Eg copy_process() copy_mm() dup_mm() dup_mmap() copy_page_range() [..] copy_one_pte() Gets to: if (is_cow_mapping(vm_flags) && pte_write(pte)) { ptep_set_wrprotect(src_mm, addr, src_pte); pte = pte_wrprotect(pte); } Which blindly write protects a FOLL_PIN page. Here src_pte will be under a pin. I *think* the failing test is basically: 1) pin_user_pages(mem, FOLL_FORCE | FOLL_WRITE) 2) pid = fork() 3) child: does a bit, then exec 4) parent: waitpid(pid) 5) *mem = 0 Here #5 is the WP triggered COW. We know a WP triggered COW is happening from the bisect and success result with MADV_DONTFORK. #2 triggers the dup_mmap() and the ptep_set_wrprotect() (From inspection, at least) This "Trial do_wp_page() simplification" patch means that when #5 goes to do COW it gets a copy instead of a re-use because the reuse_swap_page() was aborting the copy before. So, to your point, yes ideally FOLL_PIN would never write-protect pages! Looking for awhile, this now looks reasonable and doable. page_maybe_dma_pinned() was created for exactly this kind of case. I've attached a dumb sketch for the pte level (surely wrong! I have never looked at this part of the mm before!) at the end of this message. Peter, do you know this better? Does this inspire you to make a patch? :) Would really love to see this. We have such a huge expensive mess with MADV_DONTFORK, this would eliminate all that overhead. Thanks, Jason diff --git a/mm/memory.c b/mm/memory.c index 469af373ae76e1..6bc19a43da1391 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -689,6 +689,21 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, } #endif +static int duplicate_page(pte_t *newpte, struct vm_area_struct *vma, + unsigned long address, struct page *page) +{ + struct page *new_page; + + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); + if (!new_page) + return -ENOMEM; + copy_user_highpage(new_page, page, address, vma); + + /* FIXME: surely more than this */ + *newpte = mk_pte(new_page, vma->vm_page_prot); + return 0; +} + /* * copy one vm_area from one task to the other. Assumes the page tables * already present in the new task to be cleared in the whole range @@ -703,6 +718,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, unsigned long vm_flags = vma->vm_flags; pte_t pte = *src_pte; struct page *page; + bool do_src_wp; /* pte contains position in swap or file, so copy. */ if (unlikely(!pte_present(pte))) { @@ -775,15 +791,6 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, goto out_set_pte; } - /* - * If it's a COW mapping, write protect it both - * in the parent and the child - */ - if (is_cow_mapping(vm_flags) && pte_write(pte)) { - ptep_set_wrprotect(src_mm, addr, src_pte); - pte = pte_wrprotect(pte); - } - /* * If it's a shared mapping, mark it clean in * the child @@ -800,11 +807,34 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, if (!(vm_flags & VM_UFFD_WP)) pte = pte_clear_uffd_wp(pte); + do_src_wp = is_cow_mapping(vm_flags) && pte_write(pte); page = vm_normal_page(vma, addr, pte); if (page) { get_page(page); page_dup_rmap(page, false); rss[mm_counter(page)]++; + + /* + * If a page is DMA pinned we never want to write protect it, + * copy it now. + */ + if (do_src_wp && page_maybe_dma_pinned(page)) { + do_src_wp = false; + ret = duplicate_page(&pte, vma, addr, page); + if (ret) + /* FIXME: need to restructure a bit to handle this */ + return ret; + } + } + + /* + * If it's a COW mapping, write protect it both + * in the parent and the child + * FIXME check carefully this is new order is OK + */ + if (do_src_wp) { + ptep_set_wrprotect(src_mm, addr, src_pte); + pte = pte_wrprotect(pte); } out_set_pte: