This reverts commit 09854ba94c6aad7886996bfbee2530b3d8a7f4f4. This reverts commit be068f29034fb00530a053d18b8cf140c32b12b3. This reverts commit 1a0cf26323c80e2f1c58fc04f15686de61bfab0c. This example below uses O_DIRECT, on a process under clear_refs. There's no FOLL_LONGTERM involvement. All GUP pins are transient. fork is never called and even when clear_refs is cleared by an external program, fork() would not be involved. thread0 thread1 other process (or thread 3) ============= ============= =========== read syscall O_DIRECT read DMA to vaddr+0 len = 512 bytes GUP(FOLL_WRITE) DMA writing to RAM started clear_refs pte_wrprotect write vaddrA+512 page_count == 2 wp_copy_page read syscall returns read lost Notwithstanding the fact that failing O_DIRECT at sub-PAGE_SIZE granularity is an ABI break by itself, this is of course not just about O_DIRECT. recvmsg kernel TLS and plenty of other GUP FOLL_WRITE iov_iter_get_pages users would write to the memory with sub-PAGE_SIZE granularity, depending on the msghdr iov tiny buffers. Those recvmsg are already silently lost too as well as the above O_DIRECT already get lost. The fact COW must not happen too much is documented in commit 6d0a07edd17cfc12fdc1f36de8072fa17cc3666f: == This will provide fully accuracy to the mapcount calculation in the write protect faults, so page pinning will not get broken by false positive copy-on-writes. == And in the current comment above the THP mapcount: == * [..] we only use * page_trans_huge_mapcount() in the copy-on-write faults where we * need full accuracy to avoid breaking page pinning, [..] == Quote from the Link tagged: == In other words, the page_count check in do_wp_page, extends the old fork vs gup vs threads race, so that it also happens within a single thread without fork() involvement. === In addition FOLL_LONGTERM breaks for readonly DMA, despite FOLL_LONGTERM pages are now treated specially and copied in fork(). FOLL_LONGTERM is in no way special with regard to the VM core and it can't be. From the VM standpoint all GUP in are equal and breaking one, means breaking them all the same. It is not possible to resolve this by looking only at FOLL_LONGTERM, that only hides the problem, but it doesn't fully resolve it as shown above. In addition this avoids storms of extra false positive COWs if the THP are shared by different processes, which causes extra overhead, but the several performance considerations are only a secondary concern. After this commit, the copy in fork() for FOLL_LONGTERM also must be extended to all GUP pins including transient pins or we shouldn't even copy FOLL_LONGTERM pinned pages. Treating FOLL_LONGTERM any different from transient GUP pin is sign that something is fundamentally flawed. FOLL_LONGTERM can be treated different in writeback and during GUP runtime itself, but not in the VM-core when deciding when to COW or not to COW an anonymous page. This revert causes no theoretical security regression because THP is not yet using page_count in do_huge_pmd_wp_page. The alternative of this patch would be to replace the mapcount with the page_count in do_huge_pmd_wp_page too in order to really close the vmsplice attack from child to parent that way. However since a single transient GUP pin on a tail page, would elevate the page_count for all other tail pages (unlike the mapcount which is subpage granular), the above "fork-less" race would span over a HPAGE_PMD_SIZE range, it would even cross different vmas and the effect would happen at a distance in vma of different processes allowing a child to corrupt memory in the parent. That's a problem that could happen not-maliciously too. So the scenario described above, if extended to THP new refcounting, would be more concerning than the vmsplice issue, that can be tackled first in vmsplice itself, and only at a second stage in the COW code. Link: https://lkml.kernel.org/r/20210107200402.31095-1-aarcange@xxxxxxxxxx Cc: stable@xxxxxxxxxx Fixes: 09854ba94c6a ("mm: do_wp_page() simplification") Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> --- include/linux/ksm.h | 7 ++++++ mm/ksm.c | 25 +++++++++++++++++++ mm/memory.c | 59 ++++++++++++++++++++++++++++++++------------- 3 files changed, 74 insertions(+), 17 deletions(-) diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 161e8164abcf..e48b1e453ff5 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -53,6 +53,8 @@ struct page *ksm_might_need_to_copy(struct page *page, void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc); void ksm_migrate_page(struct page *newpage, struct page *oldpage); +bool reuse_ksm_page(struct page *page, + struct vm_area_struct *vma, unsigned long address); #else /* !CONFIG_KSM */ @@ -86,6 +88,11 @@ static inline void rmap_walk_ksm(struct page *page, static inline void ksm_migrate_page(struct page *newpage, struct page *oldpage) { } +static inline bool reuse_ksm_page(struct page *page, + struct vm_area_struct *vma, unsigned long address) +{ + return false; +} #endif /* CONFIG_MMU */ #endif /* !CONFIG_KSM */ diff --git a/mm/ksm.c b/mm/ksm.c index 9694ee2c71de..9503e11ab5b4 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -2665,6 +2665,31 @@ void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc) goto again; } +bool reuse_ksm_page(struct page *page, + struct vm_area_struct *vma, + unsigned long address) +{ +#ifdef CONFIG_DEBUG_VM + if (WARN_ON(is_zero_pfn(page_to_pfn(page))) || + WARN_ON(!page_mapped(page)) || + WARN_ON(!PageLocked(page))) { + dump_page(page, "reuse_ksm_page"); + return false; + } +#endif + + if (PageSwapCache(page) || !page_stable_node(page)) + return false; + /* Prohibit parallel get_ksm_page() */ + if (!page_ref_freeze(page, 1)) + return false; + + page_move_anon_rmap(page, vma); + page->index = linear_page_index(vma, address); + page_ref_unfreeze(page, 1); + + return true; +} #ifdef CONFIG_MIGRATION void ksm_migrate_page(struct page *newpage, struct page *oldpage) { diff --git a/mm/memory.c b/mm/memory.c index feff48e1465a..2587d6b5b0b2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3114,25 +3114,50 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) * not dirty accountable. */ if (PageAnon(vmf->page)) { - struct page *page = vmf->page; - - /* PageKsm() doesn't necessarily raise the page refcount */ - if (PageKsm(page) || page_count(page) != 1) - goto copy; - if (!trylock_page(page)) - goto copy; - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) { - unlock_page(page); + int total_map_swapcount; + if (PageKsm(vmf->page) && (PageSwapCache(vmf->page) || + page_count(vmf->page) != 1)) goto copy; + if (!trylock_page(vmf->page)) { + get_page(vmf->page); + pte_unmap_unlock(vmf->pte, vmf->ptl); + lock_page(vmf->page); + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + if (!pte_same(*vmf->pte, vmf->orig_pte)) { + update_mmu_tlb(vma, vmf->address, vmf->pte); + unlock_page(vmf->page); + pte_unmap_unlock(vmf->pte, vmf->ptl); + put_page(vmf->page); + return 0; + } + put_page(vmf->page); } - /* - * Ok, we've got the only map reference, and the only - * page count reference, and the page is locked, - * it's dark out, and we're wearing sunglasses. Hit it. - */ - unlock_page(page); - wp_page_reuse(vmf); - return VM_FAULT_WRITE; + if (PageKsm(vmf->page)) { + bool reused = reuse_ksm_page(vmf->page, vmf->vma, + vmf->address); + unlock_page(vmf->page); + if (!reused) + goto copy; + wp_page_reuse(vmf); + return VM_FAULT_WRITE; + } + if (reuse_swap_page(vmf->page, &total_map_swapcount)) { + if (total_map_swapcount == 1) { + /* + * The page is all ours. Move it to + * our anon_vma so the rmap code will + * not search our parent or siblings. + * Protected against the rmap code by + * the page lock. + */ + page_move_anon_rmap(vmf->page, vma); + } + unlock_page(vmf->page); + wp_page_reuse(vmf); + return VM_FAULT_WRITE; + } + unlock_page(vmf->page); } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED))) { return wp_page_shared(vmf);