Re: [PATCH v2] mm: fix aio performance regression for database caused by THP (backport to stable)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Oct 21, 2013 at 11:01:07AM -0600, Khalid Aziz wrote:
> Hi Andrea,
> 
> Sure, this can happen but wouldn't this case be caught by the last line 
> in PageHuge() - "return dtor == free_huge_page;". This conditional 
> should return false for THP PageTail, right?

The problem is that get_compound_page_dtor(page) runs on a "page"
pointer that could be a dangling link if we ever update the
tail_page->private field during
split_huge_page. get_compound_page_dtor has to dereference "page"
somehow to get to the "dtor" value.

> I see. PageHuge() changes the pointer it is passed to point to the head 
> page. Calling compound_head on the head page is superfluous even though 
> it doesn't do any harm and I can easily remove the call to 
> compound_head() in the patch, but was it really intentional in 
> PageHuge() to change the incoming pointer? That is a significant side 
> effect of a routine that on the surface looks like a simple check that 
> would return true or false.

PageHuge in current implementation can be passed a tail or head
hugetlbfs page and it looks all right for that kind of hugetlbfs
usage. It wasn't supposed to be called by the VM core that has to deal
with all kind of compound pages (not just hugetlbfs).

PageHuge cannot be passed a THP page unless we decide to depend on
first_page not to ever change across a split_huge_page. But if we
decide to depend on that (like current upstream code after the
hugetlbfs directio optimization was applied), then we should also
convert all compound_trans_head to compound_head or we leave dirt into
the source. Plus that would optimize away all other smp_rmb too.

In older kernels page->first_page was in the same union with
page->mapping and IMHO it looked too flakey to depend on alignments or
variable ordering of fields that tends to change all the time within
that very union. Now page->mapping is out of that union, so it looks
much safer safer to drop compound_trans_head.

Now that page->mapping is out of that union, the only concerns that
remains if that if we convert all compound_trans_head to
compound_head, then we'd never be allowed to ever enforce the
invariant that page->private is zero whenever PG_private is clear and
the page is not the tail of a compound page (all locations except
split_huge_page are enforcing that, but anon pages cannot care less
about page->private being zero and page->private is clobbered and then
initialized to zero during allocation by the buddy allocator when the
page is reallocated so it's fine in practice, but maybe somebody could
already consider it a bug even that we leave dirt in page->private
after split_huge_page).

So as THP expands into tmpfs and pagecache, it may be more relevant to
be consistent and be able to update page->private of the tail pages
during split_huge_page. Either to clear or maybe to propagate it from
head to tail pages.

It seems clearing page->private whenever PG_private is not set, is
purely a debug tweak and so we're effectively wasting CPU not just in
the smp_rmb of compound_trans_head, but also in the
set_page_private(page, 0) of all over the place whenever a PG_private
is cleared and starting from prep_new_page.

There are two ways to go for the long term:

1) drop all set_page_private everywhere and convert all
compound_trans_head to compound_head (and cross fingers that we'd
never need to propagate the head page->private to the tail pages
during split_huge_page and hope some random update to struct page
doesn't introduce silent corruption hardly reproducible)

2) keep using compound_trans_head and add set_page_private(tail_page,
0) to split_huge_page to enforce the invariant that all regular anon
pages have page->private zero (debug tweak already enforced everywhere
else)

I'm fine either ways, I'm just saying current source is not ok as
compound_trans_head isn't making sense anymore.

> Calling PageTail(page) after having called PageHuge(page) is pointless 
> since "page" was already changed to point to the head page. So is the 
> error in PageHuge(page) changing the pointer, or is the error in usage 
> of PageHuge()?

PageHuge is fine as is I think (I supposed it wants to succeed on tail
pages too).

In the short term I would just run compound_trans_head once like
previous code did. Then introduce a PageHeadHuge that skips the
compound_head and PageCompound check to optimize your current
patch. Then later make another patch that converts all
compound_trans_head to compound_head, or that adds
set_page_private(tail_page, 0) to split_huge_page to enforce the
PG_private vs page->private debug tweak.

> >
> > 2) it doesn't remove compound_trans_head from the rest of the kernel.
> >
> >     With this patch you just made compound_trans_head meaningless.  And
> >     you added an implicit dependency on split_huge_page that cannot
> >     ever touch page->first_page, see the "========= split_huge_page" above.
> 
> I think PageHuge() already takes this into account by checking the dtor 
> pointer. So I think this is not a problem. I do no intend to make 
> compound_trans_head meaningless but I see the problem with calling it 
> after having called PageHuge(), as I pointed out in my comments above.

I see you didn't intend to alter the THP path, it was just a side effect.

But even if we convert all compound_trans_head to compound_head we're
still left with a superflous compound_head in all cases
(hugetlbfs/slab/THP etc..) in those paths, so we can surely optimize
this further (considering this is a microoptimization it's worth
optimizing it fully).

> I/O performance regression on -stable makes this problem worth looking 
> at in my opinion. Performance regression is very significant based upon 
> tests from database folks. There is definitely room for improvement in 
> the patch and your experience with this code is immensely helpful in 
> doing that. compound_trans_head shows up as one of the big hitters in 
> perf top when running database performance tests. My only intention in 
> using PageHuge() was to determine definitively if the page I am working 
> with is hugetlbfs page or not which PageHuge() seems to do except I 
> missed that it has a significant side effect. This can be addressed by 
> implementing PageHeadHuge() which is almost identical to PageHuge() 
> without the side effect, as you suggested.

yes.

> It is reasonable to add get_page_unless_zero() to hugetlbfs only path.

We should use the benchmark to find if it's a problem to run
get_page_unless_zero there (by plugging the newly introduced
PageHeadHuge together with the PageSlab check). And then we should
also verify if the compound_trans_head if measurable by implementing
it identical to compound_head, so we know if it's worth it or not to
keep it as now it's mostly needed to cope with page->private.

Thanks!
Andrea
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]