Re: [PATCH] mm, hugetlb: Avoid double clearing for hugetlb pages

"Guilherme G. Piccoli" <gpiccoli@xxxxxxxxxxxxx> · Tue, 20 Oct 2020 16:19:06 -0300

Hi Michal, thanks a lot for your thorough response. I'll address the
comments inline, below. Thanks also David and Mike - in fact, I almost
don't need to respond here after Mike, he was right to the point I'm
going to discuss heh...

On 20/10/2020 05:20, Michal Hocko wrote:
> 
> Yes zeroying is quite costly and that is to be expected when the feature
> is enabled. Hugetlb like other allocator users perform their own
> initialization rather than go through __GFP_ZERO path. More on that
> below.
> 
> Could you be more specific about why this is a problem. Hugetlb pool is
> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages
> is non trivial amount of time but it doens't look like a major disaster
> either. If the pool is allocated later it can take much more time due to
> memory fragmentation.
> 
> I definitely do not want to downplay this but I would like to hear about
> the real life examples of the problem.

Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was
just my simple test in a guest, the real case is much worse! It aligns
with Mike's comment, we have complains of minute-like delays, due to a
very big pool of hugepages being allocated.

Users have their own methodology for allocating pages, some would prefer
do that "later" for a variety of reasons, so early boot time allocations
are not always used, that shouldn't be the only focus of the discussion
here.
In the specific report I had, the user complains about more than 3
minutes to allocate ~542G of 2M hugetlb pages.

Now, you'll ask why in the heck they are using init_on_alloc then -
right? So, the Kconfig option "CONFIG_INIT_ON_ALLOC_DEFAULT_ON" is set
by default in Ubuntu, for hardening reasons. So, the workaround for the
users complaining of delays in allocating hugetlb pages currently is to
set "init_on_alloc" to 0. It's a bit lame to ask users to disable such
hardening thing just because we have a double initialization in hugetlb...

> 
> 
> This has been discussed already (http://lkml.kernel.org/r/20190514143537.10435-4-glider@xxxxxxxxxx.
> Previously it has been brought up in SLUB context AFAIR. Your numbers
> are quite clear here but do we really need a gfp flag with all the
> problems we tend to grow in with them?
> 
> One potential way around this specifically for hugetlb would be to use
> __GFP_ZERO when allocating from the allocator and marking the fact in
> the struct page while it is sitting in the pool. Page fault handler
> could then skip the zeroying phase. Not an act of beauty TBH but it
> fits into the existing model of the full control over initialization.
> Btw. it would allow to implement init_on_free semantic as well. I
> haven't implemented the actual two main methods
> hugetlb_test_clear_pre_init_page and hugetlb_mark_pre_init_page because
> I am not entirely sure about the current state of hugetlb struct page in
> the pool. But there should be a lot of room in there (or in tail pages).
> Mike will certainly know much better. But the skeleton of the patch
> would look like something like this (not even compile tested).
> [code...]

Thanks a lot for pointing the previous discussion for me! I should have
done my homework properly and read all versions of the patchset...my
bad! I'm glad to see this problem was discussed and considered early in
the patch submission, I guess it only missed more real-world numbers.

Your approach seems interesting, but as per Mike's response (which seems
to have anticipated all my arguments heheh) your approach is a bit
reversed, solving a ""non-existent"" problem (of zeroing hugetlb pages
in fault time), whereas the big problem hereby tentatively fixed is the
massive delay on allocation time of the hugetlb pages.

I understand that your suggestion has no burden of introducing more GFP
flags, and I agree that those are potentially dangerous if misused (and
I totally agree with David that __GFP_NOINIT_ON_ALLOC is heinous, I'd
rather go with the originally proposed __GFP_NO_AUTOINIT), but...
wouldn't it be letting the code just drive a design decision? Like "oh,
adding a flag is so bad..better just let this bug/perf issue to stay".

I agree with the arguments here, don't get me wrong - specially since
I'm far from being any kind of mm expert, I trust your judgement that
GFP flags are the utmost villains, but at the same time I'd rather not
change something (like the hugetlb zeroing code) that is not really
fixing the hereby discussed issue. I'm open to other suggestions, of
course, but the GFP flag seems the least hacky way for fixing that, and
ultimately, the flags are meant for this, right? Control page behavior
stuff.

About misuse of a GFP flag, this is a risk for every "API" on kernel,
and we rely in the (knowingly great) kernel review process to block
that. We could even have a more "terrifying" comment there around the
flag, asking new users to CC all relevant involved people in the patch
submission before using that...

Anyway, thanks a bunch for the good points raised here Michal, David and
Mike, and appreciate your patience with somebody trying to mess your GFP
flags. Let me know your thoughts!

Cheers,

Guilherme