On Wed, Nov 16, 2016 at 10:53:39AM -0800, Mike Kravetz wrote: > I was running some tests with error injection to exercise the error > path and noticed the reservation leaks as the system eventually ran > out of huge pages. I need to think about it some more, but we may > want to at least do something like the following before put_page (with > a BIG comment): > > if (unlikely(PagePrivate(page))) > ClearPagePrivate(page); > > That would at least keep the global reservation count from increasing. > Let me look into that. However what happens if the old vma got munmapped and a new compatible vma was instantiated and passes revalidation fine? The reserved page of the old vma goes to a different vma then? This reservation code is complex and has lots of special cases anyway, but the main concern at this point is the set_page_private(subpool_vma(vma)) released by hugetlb_vm_op_close->unlock_or_release_subpool. Aside the accounting, what about the page_private(page) subpool? It's used by huge_page_free which would get out of sync with vma/inode destruction if we release the mmap_sem. struct hugepage_subpool *spool = (struct hugepage_subpool *)page_private(page); I think in the revalidation code we need to check if page_private(page) still matches the subpool_vma(vma), if it doesn't and it's a stale pointer, we can't even call put_page before fixing up the page_private first. The other way to solve this is not to release the mmap_sem at all and in the slow path call __get_user_pages(nonblocking=NULL). That's slower than using the CPU TLB to reach the source data and it'd prevent also to handle a userfault in the source address of UFFDIO_COPY, because an userfault in the source would require the mmap_sem to be released (i.e. it'd require get_user_pages_fast that would again recurse on the mmap_sem and in turn we could as well stick to the current nonblocking copy-user). We currently don't handle nesting with non-cooperative anyway so it'd be ok for now not to release the mmap_sem while copying in UFFDIO_COPY. Offtopic here but while reading this code I also noticed free_huge_page is wasting CPU and then noticed other places wasting CPU. >From 9f3966c5bbf88cb8f702393d6a78abf1b8f960f9 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli <aarcange@xxxxxxxxxx> Date: Thu, 17 Nov 2016 15:28:20 +0100 Subject: [PATCH 1/1] hugetlbfs: use non atomic ops when the page is private After the page has been freed it's fully private and no other CPU can manipulate the page structure anymore (other than get_page_unless_zero from speculative lookups, but those will fail because of the zero refcount). The same is true when the page has been newly allocated. So we can use faster non atomic ops for those cases. Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> --- mm/hugetlb.c | 44 ++++++++++++++++++++++++++++++++++++-------- 1 file changed, 36 insertions(+), 8 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 022750d..7c422c1 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1216,16 +1216,44 @@ bool page_huge_active(struct page *page) } /* never called for tail page */ +static __always_inline void ____set_page_huge_active(struct page *page, + bool atomic) +{ + VM_BUG_ON_PAGE(!PageHeadHuge(page), page); + if (atomic) + SetPagePrivate(&page[1]); + else + __SetPagePrivate(&page[1]); +} + static void set_page_huge_active(struct page *page) { + ____set_page_huge_active(page, true); +} + +static void __set_page_huge_active(struct page *page) +{ + ____set_page_huge_active(page, false); +} + +static __always_inline void ____clear_page_huge_active(struct page *page, + bool atomic) +{ VM_BUG_ON_PAGE(!PageHeadHuge(page), page); - SetPagePrivate(&page[1]); + if (atomic) + ClearPagePrivate(&page[1]); + else + __ClearPagePrivate(&page[1]); } static void clear_page_huge_active(struct page *page) { - VM_BUG_ON_PAGE(!PageHeadHuge(page), page); - ClearPagePrivate(&page[1]); + ____clear_page_huge_active(page, true); +} + +static void __clear_page_huge_active(struct page *page) +{ + ____clear_page_huge_active(page, false); } void free_huge_page(struct page *page) @@ -1245,7 +1273,7 @@ void free_huge_page(struct page *page) VM_BUG_ON_PAGE(page_count(page), page); VM_BUG_ON_PAGE(page_mapcount(page), page); restore_reserve = PagePrivate(page); - ClearPagePrivate(page); + __ClearPagePrivate(page); /* * A return code of zero implies that the subpool will be under its @@ -1256,7 +1284,7 @@ void free_huge_page(struct page *page) restore_reserve = true; spin_lock(&hugetlb_lock); - clear_page_huge_active(page); + __clear_page_huge_active(page); hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h), page); if (restore_reserve) @@ -3534,7 +3562,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, copy_user_huge_page(new_page, old_page, address, vma, pages_per_huge_page(h)); __SetPageUptodate(new_page); - set_page_huge_active(new_page); + __set_page_huge_active(new_page); mmun_start = address & huge_page_mask(h); mmun_end = mmun_start + huge_page_size(h); @@ -3697,7 +3725,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, } clear_huge_page(page, address, pages_per_huge_page(h)); __SetPageUptodate(page); - set_page_huge_active(page); + __set_page_huge_active(page); if (vma->vm_flags & VM_MAYSHARE) { int err = huge_add_to_page_cache(page, mapping, idx); @@ -4000,7 +4028,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, * the set_pte_at() write. */ __SetPageUptodate(page); - set_page_huge_active(page); + __set_page_huge_active(page); ptl = huge_pte_lockptr(h, dst_mm, dst_pte); spin_lock(ptl); > I will take a look, and 'may' have a test that can be modified for this. Overnight stress of postcopy live migration over hugetlbfs passed without a single glitch with the patch applied, so it's tested now. It'd still be good to add an O_DIRECT test to the selftest. The previous issue with the mmap_sem release and accounting and potential subpool use after free, is only about malicious apps, it'd be impossible to reproduce it with qemu or the current selftest, but we've to take care of it before I can resubmit for upstream. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>