Re: [PATCH v2] mm: fix aio performance regression for database caused by THP (backport to stable)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/18/2013 10:06 AM, Andrea Arcangeli wrote:
Hi,

Sorry for the self-followup but looking further into this, there are
more subtle troubles with this patch that I think are worth discussing
(not just the PageHuge check to skip the __page_cache_release).

-	struct page *page_head = compound_trans_head(page);
+	struct page *page_head;
+
+	/*
+	 * If this is a hugetlbfs page it cannot be split under us.  Simply
+	 * increment refcount for the head page.
+	 */
+	if (PageHuge(page)) {
+		page_head = compound_head(page);
+		atomic_inc(&page_head->_count);
+		got = true;
+		goto out;
+	}

PageHuge:

int PageHuge(struct page *page) /* page for this example is a THP PageTail */
{
	compound_page_dtor *dtor;

	if (!PageCompound(page))
		return 0;

	page = compound_head(page);
	dtor = get_compound_page_dtor(page);
	

	return dtor == free_huge_page;
}


compound_head looks like this:

static inline struct page *compound_head(struct page *page)
{
	if (unlikely(PageTail(page)))
		========================== split_huge_page
		return page->first_page;
	return page;
}

Hi Andrea,

Sure, this can happen but wouldn't this case be caught by the last line in PageHuge() - "return dtor == free_huge_page;". This conditional should return false for THP PageTail, right?


This patch adds comments:

	/*
	 * hugetlbfs pages cannot be split from under us.  If this is a
	 * hugetlbfs page, check refcount on head page and release the page if
	 * the refcount becomes zero.
	 */

While this comment is correct for hugetlbfs, the harder problem is
when this is a THP tail page that can be splitted... and it looks like
that wasn't taken into consideration.

It looks stable on x86 but:

1) this is just a micro-optmization and as such it should be
    microoptimzied, and it isn't.

    It's repeating compound_head twice for hugetlbfs and it's
    running compound_head+compound_trans_head for THP when a single
    one is needed in both cases.

I see. PageHuge() changes the pointer it is passed to point to the head page. Calling compound_head on the head page is superfluous even though it doesn't do any harm and I can easily remove the call to compound_head() in the patch, but was it really intentional in PageHuge() to change the incoming pointer? That is a significant side effect of a routine that on the surface looks like a simple check that would return true or false.

Calling PageTail(page) after having called PageHuge(page) is pointless since "page" was already changed to point to the head page. So is the error in PageHuge(page) changing the pointer, or is the error in usage of PageHuge()?


2) it doesn't remove compound_trans_head from the rest of the kernel.

    With this patch you just made compound_trans_head meaningless.  And
    you added an implicit dependency on split_huge_page that cannot
    ever touch page->first_page, see the "========= split_huge_page" above.

I think PageHuge() already takes this into account by checking the dtor pointer. So I think this is not a problem. I do no intend to make compound_trans_head meaningless but I see the problem with calling it after having called PageHuge(), as I pointed out in my comments above.


3) if you go ahead with 2), we'll never be able to propagate
    page->private from head to tail in split_huge_page, a thing
    potentially required by tmpfs (Kirill?), or the kernel would crash
    in put_page with this patch applied (dereferencing a dangling
    pointer because tail_page->first_page is overwritten by
    page_head->private).

4) it doesn't verify that the THP page size is never bigger than the
    smallest hugetlbfs page size.

    Such a condition would lead to silent memory corruption because it
    assumes it deals with a valid hugetlbfs head without pinning it and
    without rechecking pagetail, before taking the hugetlbfs-only path.

    See for example PageSlab check, it does exactly this. PageSlab
    min order is 1, so smaller than the THP page order.

None of the above seems to be fatal in practice on x86, but I doubt it
is worth it for -stable, and for upstream now you need to decide if we
drop all compound_trans_head.

I/O performance regression on -stable makes this problem worth looking at in my opinion. Performance regression is very significant based upon tests from database folks. There is definitely room for improvement in the patch and your experience with this code is immensely helpful in doing that. compound_trans_head shows up as one of the big hitters in perf top when running database performance tests. My only intention in using PageHuge() was to determine definitively if the page I am working with is hugetlbfs page or not which PageHuge() seems to do except I missed that it has a significant side effect. This can be addressed by implementing PageHeadHuge() which is almost identical to PageHuge() without the side effect, as you suggested.


If we don't drop compound_trans_head to allow propagating
page->private from head to tail in the future, it should be possible
to fix this patch to use a single compound_trans_head, so at the same
time we optimize away the ugly inefficiency I mentioned in 1).

Then if we want to remove the flakey assumption in 4) we should move
this together with the PageSlab check. And to move it along with
PageSlab I think you only need to implement a PageHeadHuge check (that
is required for 1) too).

If you insist to skip get_page_unless_zero (not just the
compound_lock) then I'd recommend at least to add a boot time check so
that if HPAGE_PMD_SHIFT > HPAGE_SHIFT THP disengage itself at boot and
cannot be ever enabled. So if some arch has multiple huge page sizes
and decides to make the THP page size not the smallest one available
in hardware, at least it will notice it's not a supported config in a
graceful way.


It is reasonable to add get_page_unless_zero() to hugetlbfs only path.

Thanks for reviewing this patch.

--
Khalid

--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]