On 10/18/2013 10:06 AM, Andrea Arcangeli wrote:
Hi,
Sorry for the self-followup but looking further into this, there are
more subtle troubles with this patch that I think are worth discussing
(not just the PageHuge check to skip the __page_cache_release).
- struct page *page_head = compound_trans_head(page);
+ struct page *page_head;
+
+ /*
+ * If this is a hugetlbfs page it cannot be split under us. Simply
+ * increment refcount for the head page.
+ */
+ if (PageHuge(page)) {
+ page_head = compound_head(page);
+ atomic_inc(&page_head->_count);
+ got = true;
+ goto out;
+ }
PageHuge:
int PageHuge(struct page *page) /* page for this example is a THP PageTail */
{
compound_page_dtor *dtor;
if (!PageCompound(page))
return 0;
page = compound_head(page);
dtor = get_compound_page_dtor(page);
return dtor == free_huge_page;
}
compound_head looks like this:
static inline struct page *compound_head(struct page *page)
{
if (unlikely(PageTail(page)))
========================== split_huge_page
return page->first_page;
return page;
}
Hi Andrea,
Sure, this can happen but wouldn't this case be caught by the last line
in PageHuge() - "return dtor == free_huge_page;". This conditional
should return false for THP PageTail, right?
This patch adds comments:
/*
* hugetlbfs pages cannot be split from under us. If this is a
* hugetlbfs page, check refcount on head page and release the page if
* the refcount becomes zero.
*/
While this comment is correct for hugetlbfs, the harder problem is
when this is a THP tail page that can be splitted... and it looks like
that wasn't taken into consideration.
It looks stable on x86 but:
1) this is just a micro-optmization and as such it should be
microoptimzied, and it isn't.
It's repeating compound_head twice for hugetlbfs and it's
running compound_head+compound_trans_head for THP when a single
one is needed in both cases.
I see. PageHuge() changes the pointer it is passed to point to the head
page. Calling compound_head on the head page is superfluous even though
it doesn't do any harm and I can easily remove the call to
compound_head() in the patch, but was it really intentional in
PageHuge() to change the incoming pointer? That is a significant side
effect of a routine that on the surface looks like a simple check that
would return true or false.
Calling PageTail(page) after having called PageHuge(page) is pointless
since "page" was already changed to point to the head page. So is the
error in PageHuge(page) changing the pointer, or is the error in usage
of PageHuge()?
2) it doesn't remove compound_trans_head from the rest of the kernel.
With this patch you just made compound_trans_head meaningless. And
you added an implicit dependency on split_huge_page that cannot
ever touch page->first_page, see the "========= split_huge_page" above.
I think PageHuge() already takes this into account by checking the dtor
pointer. So I think this is not a problem. I do no intend to make
compound_trans_head meaningless but I see the problem with calling it
after having called PageHuge(), as I pointed out in my comments above.
3) if you go ahead with 2), we'll never be able to propagate
page->private from head to tail in split_huge_page, a thing
potentially required by tmpfs (Kirill?), or the kernel would crash
in put_page with this patch applied (dereferencing a dangling
pointer because tail_page->first_page is overwritten by
page_head->private).
4) it doesn't verify that the THP page size is never bigger than the
smallest hugetlbfs page size.
Such a condition would lead to silent memory corruption because it
assumes it deals with a valid hugetlbfs head without pinning it and
without rechecking pagetail, before taking the hugetlbfs-only path.
See for example PageSlab check, it does exactly this. PageSlab
min order is 1, so smaller than the THP page order.
None of the above seems to be fatal in practice on x86, but I doubt it
is worth it for -stable, and for upstream now you need to decide if we
drop all compound_trans_head.
I/O performance regression on -stable makes this problem worth looking
at in my opinion. Performance regression is very significant based upon
tests from database folks. There is definitely room for improvement in
the patch and your experience with this code is immensely helpful in
doing that. compound_trans_head shows up as one of the big hitters in
perf top when running database performance tests. My only intention in
using PageHuge() was to determine definitively if the page I am working
with is hugetlbfs page or not which PageHuge() seems to do except I
missed that it has a significant side effect. This can be addressed by
implementing PageHeadHuge() which is almost identical to PageHuge()
without the side effect, as you suggested.
If we don't drop compound_trans_head to allow propagating
page->private from head to tail in the future, it should be possible
to fix this patch to use a single compound_trans_head, so at the same
time we optimize away the ugly inefficiency I mentioned in 1).
Then if we want to remove the flakey assumption in 4) we should move
this together with the PageSlab check. And to move it along with
PageSlab I think you only need to implement a PageHeadHuge check (that
is required for 1) too).
If you insist to skip get_page_unless_zero (not just the
compound_lock) then I'd recommend at least to add a boot time check so
that if HPAGE_PMD_SHIFT > HPAGE_SHIFT THP disengage itself at boot and
cannot be ever enabled. So if some arch has multiple huge page sizes
and decides to make the THP page size not the smallest one available
in hardware, at least it will notice it's not a supported config in a
graceful way.
It is reasonable to add get_page_unless_zero() to hugetlbfs only path.
Thanks for reviewing this patch.
--
Khalid
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html