Starting from commit 17839856fd58 ("gup: document and work around "COW can break either way" issue", 2020-06-02), explicit copy-on-write behavior is enforced for private gup pages even if it's a read-only. It is achieved by always passing FOLL_WRITE to emulate a write. That should fix the COW issue that we were facing, however above commit could also break userfaultfd-wp and applications like umapsort [1,2]. One general routine of umap-like program is: userspace library will manage page allocations, and it will evict the least recently used pages from memory to external storages (e.g., file systems). Below are the general steps to evict an in-memory page in the uffd service thread when the page pool is full: (1) UFFDIO_WRITEPROTECT with mode=WP on some to-be-evicted page P, so that further writes to page P will block (keep page P clean) (2) Copy page P to external storage (e.g. file system) (3) MADV_DONTNEED to evict page P Here step (1) makes sure that the page to dump will always be up-to-date, so that the page snapshot in the file system is consistent with the one that was in the memory. However with commit 17839856fd58, step (2) can potentially hang itself because e.g. if we use write() to a file system fd to dump the page data, that will be a translated read gup request in the file system driver to read the page content, then the read gup will be translated to a write gup due to the new enforced COW behavior. This write gup will further trigger handle_userfault() and hang the uffd service thread itself. I think the problem will go away too if we replace the write() to the file system into a memory write to a mmaped region in the userspace library, because normal page faults will not enforce COW, only gup is affected. However we cannot forbid users to use write() or any form of kernel level read gup. One solution is actually already mentioned in commit 17839856fd58, which is to provide an explicit BREAK_COW scemantics for enforced COW. Then we can still use FAULT_FLAG_WRITE to identify whether this is a "real write request" or an "enfornced COW (read) request". [1] https://github.com/LLNL/umap-apps/blob/develop/src/umapsort/umapsort.cpp [2] https://github.com/LLNL/umap CC: Marty Mcfadden <mcfadden8@xxxxxxxx> CC: Andrea Arcangeli <aarcange@xxxxxxxxxx> CC: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> CC: Jann Horn <jannh@xxxxxxxxxx> CC: Christoph Hellwig <hch@xxxxxx> CC: Oleg Nesterov <oleg@xxxxxxxxxx> CC: Kirill Shutemov <kirill@xxxxxxxxxxxxx> CC: Jan Kara <jack@xxxxxxx> Fixes: 17839856fd588f4ab6b789f482ed3ffd7c403e1f Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> --- include/linux/mm.h | 3 +++ mm/gup.c | 4 ++-- mm/memory.c | 15 ++++++++++++--- 3 files changed, 17 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index f6a82f9bccd7..dacba5c7942f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -409,6 +409,7 @@ extern pgprot_t protection_map[16]; * @FAULT_FLAG_REMOTE: The fault is not for current task/mm. * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch. * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals. + * @FAULT_FLAG_BREAK_COW: Do COW explicitly for the fault (even for read) * * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify * whether we would allow page faults to retry by specifying these two @@ -439,6 +440,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_REMOTE 0x80 #define FAULT_FLAG_INSTRUCTION 0x100 #define FAULT_FLAG_INTERRUPTIBLE 0x200 +#define FAULT_FLAG_BREAK_COW 0x400 /* * The default fault flags that should be used by most of the @@ -2756,6 +2758,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, #define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ #define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */ #define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */ +#define FOLL_BREAK_COW 0x100000 /* request for explicit COW (even for read) */ /* * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each diff --git a/mm/gup.c b/mm/gup.c index d8a33dd1430d..02267f5797a7 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1076,7 +1076,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } if (is_vm_hugetlb_page(vma)) { if (should_force_cow_break(vma, foll_flags)) - foll_flags |= FOLL_WRITE; + foll_flags |= FOLL_BREAK_COW; i = follow_hugetlb_page(mm, vma, pages, vmas, &start, &nr_pages, i, foll_flags, locked); @@ -1095,7 +1095,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } if (should_force_cow_break(vma, foll_flags)) - foll_flags |= FOLL_WRITE; + foll_flags |= FOLL_BREAK_COW; retry: /* diff --git a/mm/memory.c b/mm/memory.c index c39a13b09602..0c819056374e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2900,7 +2900,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; - if (userfaultfd_pte_wp(vma, *vmf->pte)) { + if ((vmf->flags & FAULT_FLAG_WRITE) && + userfaultfd_pte_wp(vma, *vmf->pte)) { pte_unmap_unlock(vmf->pte, vmf->ptl); return handle_userfault(vmf, VM_UFFD_WP); } @@ -3290,7 +3291,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) put_page(swapcache); } - if (vmf->flags & FAULT_FLAG_WRITE) { + /* + * We'll do a COW if it's a write or the caller wants explicit COW + * behavior (even if it's a read operation) + */ + if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) { ret |= do_wp_page(vmf); if (ret & VM_FAULT_ERROR) ret &= VM_FAULT_ERROR; @@ -4241,7 +4246,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); goto unlock; } - if (vmf->flags & FAULT_FLAG_WRITE) { + /* + * We'll do a COW if it's a write or the caller wants explicit COW + * behavior (even if it's a read operation) + */ + if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) { if (!pte_write(entry)) return do_wp_page(vmf); entry = pte_mkdirty(entry); -- 2.26.2