Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Thu, 17 Nov 2016 16:40:31 +0100

On Wed, Nov 16, 2016 at 10:53:39AM -0800, Mike Kravetz wrote:
> I was running some tests with error injection to exercise the error
> path and noticed the reservation leaks as the system eventually ran
> out of huge pages.  I need to think about it some more, but we may
> want to at least do something like the following before put_page (with
> a BIG comment):
> 
> 	if (unlikely(PagePrivate(page)))
> 		ClearPagePrivate(page);
> 
> That would at least keep the global reservation count from increasing.
> Let me look into that.

However what happens if the old vma got munmapped and a new compatible
vma was instantiated and passes revalidation fine? The reserved page
of the old vma goes to a different vma then?

This reservation code is complex and has lots of special cases anyway,
but the main concern at this point is the
set_page_private(subpool_vma(vma)) released by
hugetlb_vm_op_close->unlock_or_release_subpool.

Aside the accounting, what about the page_private(page) subpool? It's
used by huge_page_free which would get out of sync with vma/inode
destruction if we release the mmap_sem.

	struct hugepage_subpool *spool =
		(struct hugepage_subpool *)page_private(page);

I think in the revalidation code we need to check if
page_private(page) still matches the subpool_vma(vma), if it doesn't
and it's a stale pointer, we can't even call put_page before fixing up
the page_private first.

The other way to solve this is not to release the mmap_sem at all and
in the slow path call __get_user_pages(nonblocking=NULL). That's
slower than using the CPU TLB to reach the source data and it'd
prevent also to handle a userfault in the source address of
UFFDIO_COPY, because an userfault in the source would require the
mmap_sem to be released (i.e. it'd require get_user_pages_fast that
would again recurse on the mmap_sem and in turn we could as well stick
to the current nonblocking copy-user). We currently don't handle
nesting with non-cooperative anyway so it'd be ok for now not to
release the mmap_sem while copying in UFFDIO_COPY.


Offtopic here but while reading this code I also noticed
free_huge_page is wasting CPU and then noticed other places wasting
CPU.

>From 9f3966c5bbf88cb8f702393d6a78abf1b8f960f9 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Date: Thu, 17 Nov 2016 15:28:20 +0100
Subject: [PATCH 1/1] hugetlbfs: use non atomic ops when the page is private

After the page has been freed it's fully private and no other CPU can
manipulate the page structure anymore (other than get_page_unless_zero
from speculative lookups, but those will fail because of the zero
refcount).

The same is true when the page has been newly allocated.

So we can use faster non atomic ops for those cases.

Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
---
 mm/hugetlb.c | 44 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 022750d..7c422c1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1216,16 +1216,44 @@ bool page_huge_active(struct page *page)
 }
 
 /* never called for tail page */
+static __always_inline void ____set_page_huge_active(struct page *page,
+						     bool atomic)
+{
+	VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
+	if (atomic)
+		SetPagePrivate(&page[1]);
+	else
+		__SetPagePrivate(&page[1]);
+}
+
 static void set_page_huge_active(struct page *page)
 {
+	____set_page_huge_active(page, true);
+}
+
+static void __set_page_huge_active(struct page *page)
+{
+	____set_page_huge_active(page, false);
+}
+
+static __always_inline void ____clear_page_huge_active(struct page *page,
+						       bool atomic)
+{
 	VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
-	SetPagePrivate(&page[1]);
+	if (atomic)
+		ClearPagePrivate(&page[1]);
+	else
+		__ClearPagePrivate(&page[1]);
 }
 
 static void clear_page_huge_active(struct page *page)
 {
-	VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
-	ClearPagePrivate(&page[1]);
+	____clear_page_huge_active(page, true);
+}
+
+static void __clear_page_huge_active(struct page *page)
+{
+	____clear_page_huge_active(page, false);
 }
 
 void free_huge_page(struct page *page)
@@ -1245,7 +1273,7 @@ void free_huge_page(struct page *page)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(page_mapcount(page), page);
 	restore_reserve = PagePrivate(page);
-	ClearPagePrivate(page);
+	__ClearPagePrivate(page);
 
 	/*
 	 * A return code of zero implies that the subpool will be under its
@@ -1256,7 +1284,7 @@ void free_huge_page(struct page *page)
 		restore_reserve = true;
 
 	spin_lock(&hugetlb_lock);
-	clear_page_huge_active(page);
+	__clear_page_huge_active(page);
 	hugetlb_cgroup_uncharge_page(hstate_index(h),
 				     pages_per_huge_page(h), page);
 	if (restore_reserve)
@@ -3534,7 +3562,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	copy_user_huge_page(new_page, old_page, address, vma,
 			    pages_per_huge_page(h));
 	__SetPageUptodate(new_page);
-	set_page_huge_active(new_page);
+	__set_page_huge_active(new_page);
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
@@ -3697,7 +3725,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		clear_huge_page(page, address, pages_per_huge_page(h));
 		__SetPageUptodate(page);
-		set_page_huge_active(page);
+		__set_page_huge_active(page);
 
 		if (vma->vm_flags & VM_MAYSHARE) {
 			int err = huge_add_to_page_cache(page, mapping, idx);
@@ -4000,7 +4028,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	 * the set_pte_at() write.
 	 */
 	__SetPageUptodate(page);
-	set_page_huge_active(page);
+	__set_page_huge_active(page);
 
 	ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
 	spin_lock(ptl);



> I will take a look, and 'may' have a test that can be modified for this.

Overnight stress of postcopy live migration over hugetlbfs passed
without a single glitch with the patch applied, so it's tested
now. It'd still be good to add an O_DIRECT test to the selftest.

The previous issue with the mmap_sem release and accounting and
potential subpool use after free, is only about malicious apps, it'd
be impossible to reproduce it with qemu or the current selftest, but
we've to take care of it before I can resubmit for upstream.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>