Re: [PATCH v2] mm: fix aio performance regression for database caused by THP (backport to stable)

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Fri, 18 Oct 2013 18:06:02 +0200

Hi,

Sorry for the self-followup but looking further into this, there are
more subtle troubles with this patch that I think are worth discussing
(not just the PageHuge check to skip the __page_cache_release).

> > -	struct page *page_head = compound_trans_head(page);
> > +	struct page *page_head;
> > +
> > +	/*
> > +	 * If this is a hugetlbfs page it cannot be split under us.  Simply
> > +	 * increment refcount for the head page.
> > +	 */
> > +	if (PageHuge(page)) {
> > +		page_head = compound_head(page);
> > +		atomic_inc(&page_head->_count);
> > +		got = true;
> > +		goto out;
> > +	}

PageHuge:

int PageHuge(struct page *page) /* page for this example is a THP PageTail */
{
	compound_page_dtor *dtor;

	if (!PageCompound(page))
		return 0;

	page = compound_head(page);
	dtor = get_compound_page_dtor(page);

	return dtor == free_huge_page;
}

compound_head looks like this:

static inline struct page *compound_head(struct page *page)
{
	if (unlikely(PageTail(page)))
		========================== split_huge_page
		return page->first_page;
	return page;
}

This patch adds comments:

	/*
	 * hugetlbfs pages cannot be split from under us.  If this is a
	 * hugetlbfs page, check refcount on head page and release the page if
	 * the refcount becomes zero.
	 */

While this comment is correct for hugetlbfs, the harder problem is
when this is a THP tail page that can be splitted... and it looks like
that wasn't taken into consideration.

It looks stable on x86 but:

1) this is just a micro-optmization and as such it should be
   microoptimzied, and it isn't.

   It's repeating compound_head twice for hugetlbfs and it's
   running compound_head+compound_trans_head for THP when a single
   one is needed in both cases.

2) it doesn't remove compound_trans_head from the rest of the kernel.

   With this patch you just made compound_trans_head meaningless.  And
   you added an implicit dependency on split_huge_page that cannot
   ever touch page->first_page, see the "========= split_huge_page" above.

3) if you go ahead with 2), we'll never be able to propagate
   page->private from head to tail in split_huge_page, a thing
   potentially required by tmpfs (Kirill?), or the kernel would crash
   in put_page with this patch applied (dereferencing a dangling
   pointer because tail_page->first_page is overwritten by
   page_head->private).

4) it doesn't verify that the THP page size is never bigger than the
   smallest hugetlbfs page size.

   Such a condition would lead to silent memory corruption because it
   assumes it deals with a valid hugetlbfs head without pinning it and
   without rechecking pagetail, before taking the hugetlbfs-only path.

   See for example PageSlab check, it does exactly this. PageSlab
   min order is 1, so smaller than the THP page order.

None of the above seems to be fatal in practice on x86, but I doubt it
is worth it for -stable, and for upstream now you need to decide if we
drop all compound_trans_head.

If we don't drop compound_trans_head to allow propagating
page->private from head to tail in the future, it should be possible
to fix this patch to use a single compound_trans_head, so at the same
time we optimize away the ugly inefficiency I mentioned in 1).

Then if we want to remove the flakey assumption in 4) we should move
this together with the PageSlab check. And to move it along with
PageSlab I think you only need to implement a PageHeadHuge check (that
is required for 1) too).

If you insist to skip get_page_unless_zero (not just the
compound_lock) then I'd recommend at least to add a boot time check so
that if HPAGE_PMD_SHIFT > HPAGE_SHIFT THP disengage itself at boot and
cannot be ever enabled. So if some arch has multiple huge page sizes
and decides to make the THP page size not the smallest one available
in hardware, at least it will notice it's not a supported config in a
graceful way.
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html