Re: bug: data corruption introduced by commit 83d116c53058 ("mm: fix double page fault on arm64 if PTE_AF is cleared")

"Kirill A. Shutemov" <kirill@xxxxxxxxxxxxx> · Tue, 11 Feb 2020 17:51:58 +0300

On Mon, Feb 10, 2020 at 05:51:50PM -0500, Jeff Moyer wrote:
> Hi,
> 
> While running xfstests test generic/437, I noticed that the following
> WARN_ON_ONCE inside cow_user_page() was triggered:
> 
> 	/*
> 	 * This really shouldn't fail, because the page is there
> 	 * in the page tables. But it might just be unreadable,
> 	 * in which case we just give up and fill the result with
> 	 * zeroes.
> 	 */
> 	if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) {
> 		/*
> 		 * Give a warn in case there can be some obscure
> 		 * use-case
> 		 */
> 		WARN_ON_ONCE(1);
> 		clear_page(kaddr);
> 	}
> 
> Just clearing the page, in this case, will result in data corruption.  I
> think the assumption that the copy fails because the memory is
> inaccessible may be wrong.  In this instance, it may be the other thread
> issuing the madvise call?
> 
> I reverted the commit in question, and the data corruption is gone.
> 
> Below is the (ppc64) stack trace (this will reproduce on x86 as well).
> I've attached the reproducer, which is a modified version of the
> xfs test.  You'll need to setup a file system on pmem, and mount it with
> -o dax.  Then issue "./t_mmap_cow_race /mnt/pmem/foo".
> 
> Any help tracking this down is appreciated.

My guess is that MADV_DONTNEED get the page unmapped under you and
__copy_from_user_inatomic() sees empty PTE instead of the populated PTE it
expects.

Below is my completely untested attempt to fix it.

It is going to hurt perfomance in common case, but it should be good
enough to test my idea.

The real solution would be to retry __copy_from_user_inatomic() under ptl
if the first attempt fails. I expect it to be ugly.

diff --git a/mm/memory.c b/mm/memory.c
index 0bccc622e482..362a791f47fd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2257,7 +2257,6 @@ static inline bool cow_user_page(struct page *dst, struct page *src,
 	bool ret;
 	void *kaddr;
 	void __user *uaddr;
-	bool force_mkyoung;
 	struct vm_area_struct *vma = vmf->vma;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr = vmf->address;
@@ -2278,27 +2277,18 @@ static inline bool cow_user_page(struct page *dst, struct page *src,
 	kaddr = kmap_atomic(dst);
 	uaddr = (void __user *)(addr & PAGE_MASK);
 
+	vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
+	if (!pte_same(*vmf->pte, vmf->orig_pte)) {
+		ret = false;
+		goto pte_unlock;
+	}
+
 	/*
 	 * On architectures with software "accessed" bits, we would
 	 * take a double page fault, so mark it accessed here.
 	 */
-	force_mkyoung = arch_faults_on_old_pte() && !pte_young(vmf->orig_pte);
-	if (force_mkyoung) {
-		pte_t entry;
-
-		vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
-		if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
-			/*
-			 * Other thread has already handled the fault
-			 * and we don't need to do anything. If it's
-			 * not the case, the fault will be triggered
-			 * again on the same address.
-			 */
-			ret = false;
-			goto pte_unlock;
-		}
-
-		entry = pte_mkyoung(vmf->orig_pte);
+	if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
+		pte_t entry = pte_mkyoung(vmf->orig_pte);
 		if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0))
 			update_mmu_cache(vma, addr, vmf->pte);
 	}
@@ -2321,8 +2311,7 @@ static inline bool cow_user_page(struct page *dst, struct page *src,
 	ret = true;
 
 pte_unlock:
-	if (force_mkyoung)
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	kunmap_atomic(kaddr);
 	flush_dcache_page(dst);
 
-- 
 Kirill A. Shutemov