Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Thu, 16 Feb 2017 16:18:51 -0800

On 02/16/2017 10:41 AM, Andrea Arcangeli wrote:
> On Wed, Feb 15, 2017 at 01:46:50PM -0800, Mike Kravetz wrote:
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index d0d1d08..41f6c51 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -4029,6 +4029,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>>  	__SetPageUptodate(page);
>>  	set_page_huge_active(page);
>>  
>> +	/*
>> +	 * If shared, add to page cache
>> +	 */
>> +	if (dst_vma->vm_flags & VM_SHARED) {
> 
> Minor nitpick, this could be a:
> 
>       int vm_shared = dst_vma->vm_flags & VM_SHARED;
> 
> (int faster than bool here as VM_SHARED won't have to be converted into 0|1)
> 
>> @@ -386,7 +413,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>>  		goto out_unlock;
>>  
>>  	err = -EINVAL;
>> -	if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED)
>> +	if (!vma_is_shmem(dst_vma) && !is_vm_hugetlb_page(dst_vma) &&
>> +	    dst_vma->vm_flags & VM_SHARED)
>>  		goto out_unlock;
> 
> Other minor nitpick, this could have been faster as:
> 
>      if (vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED)
> 
> Thinking twice, the only case we need to rule out is shmem_zero_setup
> (it's not anon vmas can be really VM_SHARED or they wouldn't be anon
> vmas in the first place) so even the above is superfluous because
> shmem_zero_setup does:
> 
> 	vma->vm_ops = &shmem_vm_ops;
> 
> So I would turn it into:
> 
> 
>      /*
>       * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but
>       * it will overwrite vm_ops, so vma_is_anonymous must return false.
>       */
>      if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED))
>  		goto out_unlock;
> 
> 
> Reviewed-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> ---

Thanks Andrea, I incorporated your suggestions into a new version of the patch. 
While changing (dst_vma->vm_flags & VM_SHARED) to integers, I noticed an issue
in the error path of __mcopy_atomic_hugetlb().

>                */
> -             ClearPagePrivate(page);
> +             if (dst_vma->vm_flags & VM_SHARED)
> +                     SetPagePrivate(page);
> +             else
> +                     ClearPagePrivate(page);
>               put_page(page);

We can not use dst_vma here as it may be different than the vma for which the
page was originally allocated, or even NULL.  Remember, we may drop mmap_sem
and look up dst_vma again.  Therefore, we need to save the value of
(dst_vma->vm_flags & VM_SHARED) for the vma which was used when the page was
allocated.  This change as well as your suggestions are in the patch below:

>From e13d3b4ab24f2c423a8b0d645f0ea715c285880d Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
Date: Wed, 15 Feb 2017 13:02:41 -0800
Subject: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared
 mappings

When userfaultfd hugetlbfs support was originally added, it followed
the pattern of anon mappings and did not support any vmas marked
VM_SHARED.  As such, support was only added for private mappings.

Remove this limitation and support shared mappings.  The primary
functional change required is adding pages to the page cache.  More
subtle changes are required for huge page reservation handling in
error paths.  A lengthy comment in the code describes the reservation
handling.

Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx>
Cc: "Dr. David Alan Gilbert" <dgilbert@xxxxxxxxxx>
Cc: Hillf Danton <hillf.zj@xxxxxxxxxxxxxxx>
Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
Cc: Pavel Emelyanov <xemul@xxxxxxxxxxxxx>
---
 mm/hugetlb.c     | 26 ++++++++++++++++++--
 mm/userfaultfd.c | 74 ++++++++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 82 insertions(+), 18 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d0d1d08..2e0e815 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3993,6 +3993,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    unsigned long src_addr,
 			    struct page **pagep)
 {
+	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	struct hstate *h = hstate_vma(dst_vma);
 	pte_t _dst_pte;
 	spinlock_t *ptl;
@@ -4029,6 +4030,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	__SetPageUptodate(page);
 	set_page_huge_active(page);
 
+	/*
+	 * If shared, add to page cache
+	 */
+	if (vm_shared) {
+		struct address_space *mapping = dst_vma->vm_file->f_mapping;
+		pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
+
+		ret = huge_add_to_page_cache(page, mapping, idx);
+		if (ret)
+			goto out_release_nounlock;
+	}
+
 	ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
 	spin_lock(ptl);
 
@@ -4036,8 +4049,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (!huge_pte_none(huge_ptep_get(dst_pte)))
 		goto out_release_unlock;
 
-	ClearPagePrivate(page);
-	hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
+	if (vm_shared) {
+		page_dup_rmap(page, true);
+	} else {
+		ClearPagePrivate(page);
+		hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
+	}
 
 	_dst_pte = make_huge_pte(dst_vma, page, dst_vma->vm_flags & VM_WRITE);
 	if (dst_vma->vm_flags & VM_WRITE)
@@ -4054,11 +4071,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
 
 	spin_unlock(ptl);
+	if (vm_shared)
+		unlock_page(page);
 	ret = 0;
 out:
 	return ret;
 out_release_unlock:
 	spin_unlock(ptl);
+out_release_nounlock:
+	if (vm_shared)
+		unlock_page(page);
 	put_page(page);
 	goto out;
 }
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index b861cf9..2fc0e76 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -154,6 +154,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 					      unsigned long len,
 					      bool zeropage)
 {
+	int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED;
+	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	ssize_t err;
 	pte_t *dst_pte;
 	unsigned long src_addr, dst_addr;
@@ -211,14 +213,14 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 			goto out_unlock;
 
 		/*
-		 * Make sure the vma is not shared, that the remaining dst
-		 * range is both valid and fully within a single existing vma.
+		 * Make sure the remaining dst range is both valid and
+		 * fully within a single existing vma.
 		 */
-		if (dst_vma->vm_flags & VM_SHARED)
-			goto out_unlock;
 		if (dst_start < dst_vma->vm_start ||
 		    dst_start + len > dst_vma->vm_end)
 			goto out_unlock;
+
+		vm_shared = dst_vma->vm_flags & VM_SHARED;
 	}
 
 	if (WARN_ON(dst_addr & (vma_hpagesize - 1) ||
@@ -226,11 +228,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		goto out_unlock;
 
 	/*
-	 * Ensure the dst_vma has a anon_vma.
+	 * If not shared, ensure the dst_vma has a anon_vma.
 	 */
 	err = -ENOMEM;
-	if (unlikely(anon_vma_prepare(dst_vma)))
-		goto out_unlock;
+	if (!vm_shared) {
+		if (unlikely(anon_vma_prepare(dst_vma)))
+			goto out_unlock;
+	}
 
 	h = hstate_vma(dst_vma);
 
@@ -267,6 +271,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 						dst_addr, src_addr, &page);
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+		vm_alloc_shared = vm_shared;
 
 		cond_resched();
 
@@ -306,18 +311,49 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	if (page) {
 		/*
 		 * We encountered an error and are about to free a newly
-		 * allocated huge page.  It is possible that there was a
-		 * reservation associated with the page that has been
-		 * consumed.  See the routine restore_reserve_on_error
-		 * for details.  Unfortunately, we can not call
-		 * restore_reserve_on_error now as it would require holding
-		 * mmap_sem.  Clear the PagePrivate flag so that the global
+		 * allocated huge page.
+		 *
+		 * Reservation handling is very subtle, and is different for
+		 * private and shared mappings.  See the routine
+		 * restore_reserve_on_error for details.  Unfortunately, we
+		 * can not call restore_reserve_on_error now as it would
+		 * require holding mmap_sem.
+		 *
+		 * If a reservation for the page existed in the reservation
+		 * map of a private mapping, the map was modified to indicate
+		 * the reservation was consumed when the page was allocated.
+		 * We clear the PagePrivate flag now so that the global
 		 * reserve count will not be incremented in free_huge_page.
 		 * The reservation map will still indicate the reservation
 		 * was consumed and possibly prevent later page allocation.
-		 * This is better than leaking a global reservation.
+		 * This is better than leaking a global reservation.  If no
+		 * reservation existed, it is still safe to clear PagePrivate
+		 * as no adjustments to reservation counts were made during
+		 * allocation.
+		 *
+		 * The reservation map for shared mappings indicates which
+		 * pages have reservations.  When a huge page is allocated
+		 * for an address with a reservation, no change is made to
+		 * the reserve map.  In this case PagePrivate will be set
+		 * to indicate that the global reservation count should be
+		 * incremented when the page is freed.  This is the desired
+		 * behavior.  However, when a huge page is allocated for an
+		 * address without a reservation a reservation entry is added
+		 * to the reservation map, and PagePrivate will not be set.
+		 * When the page is freed, the global reserve count will NOT
+		 * be incremented and it will appear as though we have leaked
+		 * reserved page.  In this case, set PagePrivate so that the
+		 * global reserve count will be incremented to match the
+		 * reservation map entry which was created.
+		 *
+		 * Note that vm_alloc_shared is based on the flags of the vma
+		 * for which the page was originally allocated.  dst_vma could
+		 * be different or NULL on error.
 		 */
-		ClearPagePrivate(page);
+		if (vm_alloc_shared)
+			SetPagePrivate(page);
+		else
+			ClearPagePrivate(page);
 		put_page(page);
 	}
 	BUG_ON(copied < 0);
@@ -386,8 +422,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		goto out_unlock;
 
 	err = -EINVAL;
-	if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED)
+	/*
+	 * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but
+	 * it will overwrite vm_ops, so vma_is_anonymous must return false.
+	 */
+	if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) &&
+	    dst_vma->vm_flags & VM_SHARED))
 		goto out_unlock;
+
 	if (dst_start < dst_vma->vm_start ||
 	    dst_start + len > dst_vma->vm_end)
 		goto out_unlock;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>