Re: [PATCHv4 12/24] thp: PMD splitting without splitting compound page

"Kirill A. Shutemov" <kirill@xxxxxxxxxxxxx> · Thu, 2 Apr 2015 18:39:36 +0300

On Wed, Apr 01, 2015 at 12:08:35PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx> writes:
> 
> > Current split_huge_page() combines two operations: splitting PMDs into
> > tables of PTEs and splitting underlying compound page. This patch
> > changes split_huge_pmd() implementation to split the given PMD without
> > splitting other PMDs this page mapped with or underlying compound page.
> >
> > In order to do this we have to get rid of tail page refcounting, which
> > uses _mapcount of tail pages. Tail page refcounting is needed to be able
> > to split THP page at any point: we always know which of tail pages is
> > pinned (i.e. by get_user_pages()) and can distribute page count
> > correctly.
> >
> > We can avoid this by allowing split_huge_page() to fail if the compound
> > page is pinned. This patch removes all infrastructure for tail page
> > refcounting and make split_huge_page() to always return -EBUSY. All
> > split_huge_page() users already know how to handle its fail. Proper
> > implementation will be added later.
> >
> > Without tail page refcounting, implementation of split_huge_pmd() is
> > pretty straight-forward.
> >
> 
> With this we now have pte mapping part of a compound page(). Now the
> gneric gup implementation does
> 
> gup_pte_range()
> 	ptem = ptep = pte_offset_map(&pmd, addr);
> 	do {
> 
> ....
> ...
> 		if (!page_cache_get_speculative(page))
> 			goto pte_unmap;
> .....
>         }
> 
> That page_cache_get_speculative will fail in our case because it does
> if (unlikely(!get_page_unless_zero(page))) on a tail page. ??

IIUC, something as simple as patch below should work fine with migration
entries.

The reason I'm talking about migration enties is that with new refcounting
split_huge_page() breaks this generic fast GUP invariant:

 *  *) THP splits will broadcast an IPI, this can be achieved by overriding
 *      pmdp_splitting_flush.

We don't necessary trigger IPI during split. The page can be mapped only
with ptes by split time. That's fine for migration entries since we
re-check pte value after taking the pin. But it seems we don't have
anything in place for compound_lock case.

Hm. If I will not find any way to get it work with compound_lock, I would
need to implement new split_huge_page() on migration entries without
intermediate step with compound_lock.

Any comments?

diff --git a/mm/gup.c b/mm/gup.c
index d58af0785d24..b45edb8e6455 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1047,7 +1047,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		 * for an example see gup_get_pte in arch/x86/mm/gup.c
 		 */
 		pte_t pte = READ_ONCE(*ptep);
-		struct page *page;
+		struct page *head, *page;
 
 		/*
 		 * Similar to the PMD case below, NUMA hinting must take slow
@@ -1059,15 +1059,17 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
+		head = compound_head(page);
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(head))
 			goto pte_unmap;
 
 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-			put_page(page);
+			put_page(head);
 			goto pte_unmap;
 		}
 
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
 		(*nr)++;
 
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>