Re: [RFC PATCH] mm: hugetlb: remove __GFP_THISNODE flag when dissolving the old hugetlb

Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> · Mon, 5 Feb 2024 21:06:17 +0800

On 2/5/2024 5:15 PM, Michal Hocko wrote:
On Mon 05-02-24 10:50:32, Baolin Wang wrote:

On 2/2/2024 5:55 PM, Michal Hocko wrote:
On Fri 02-02-24 17:29:02, Baolin Wang wrote:
On 2/2/2024 4:17 PM, Michal Hocko wrote:
[...]
Agree. So how about below changing?
(1) disallow fallbacking to other nodes when handing in-use hugetlb, which
can ensure consistent behavior in handling hugetlb.

I can see two cases here. alloc_contig_range which is an internal kernel
user and then we have memory offlining. The former shouldn't break the
per-node hugetlb pool reservations, the latter might not have any other
choice (the whole node could get offline and that resembles breaking cpu
affininty if the cpu is gone).

IMO, not always true for memory offlining, when handling a free hugetlb, it
disallows fallbacking, which is inconsistent.

It's been some time I've looked into that code so I am not 100% sure how
the free pool is currently handled. The above is the way I _think_ it
should work from the usability POV.

Please see alloc_and_dissolve_hugetlb_folio().

This is the alloc_contig_range rather than offlining path. Page
offlining migrates in-use pages to a _different_ node (as long as there is one
available) via do_migrate_range and it disolves free hugetlb pages via
dissolve_free_huge_pages. So the node's pool is altered but as this is
an explicit offling operation I think there is not choice to go
differently.

Not only memory offlining, but also the longterm pinning (in
migrate_longterm_unpinnable_pages()) and memory failure (in
soft_offline_in_use_page()) can also break the per-node hugetlb pool
reservations.

Bad

Now I can see how a hugetlb page sitting inside a CMA region breaks CMA
users expectations but hugetlb migration already tries hard to allocate
a replacement hugetlb so the system must be under a heavy memory
pressure if that fails, right? Is it possible that the hugetlb
reservation is just overshooted here? Maybe the memory is just terribly
fragmented though?

Could you be more specific about numbers in your failure case?

Sure. Our customer's machine contains serveral numa nodes, and the system
reserves a large number of CMA memory occupied 50% of the total memory which
is used for the virtual machine, meanwhile it also reserves lots of hugetlb
which can occupy 50% of the CMA. So before starting the virtual machine, the
hugetlb can use 50% of the CMA, but when starting the virtual machine, the
CMA will be used by the virtual machine and the hugetlb should be migrated
from CMA.

Would it make more sense for hugetlb pages to _not_ use CMA in this
case? I mean would be better off overall if the hugetlb pool was
preallocated before the CMA is reserved? I do realize this is just
working around the current limitations but it could be better than
nothing.

In this case, the CMA area is large and occupies 50% of the total memory.
The purpose is that, if no virtual machines are launched, then CMA memory
can be used by hugetlb as much as possible. Once the virtual machines need
to be launched, it is necessary to allocate CMA memory as much as possible,
such as migrating hugetlb from CMA memory.

I am afraid that your assumption doesn't correspond to the existing
implemntation. hugetlb allocations are movable but they are certainly
not as movable as regular pages. So you have to consider a bigger
margin and spare memory to achieve a more reliable movability.

Have you tried to handle this from the userspace. It seems that you know
when there is the CMA demand to you could rebalance hugetlb pools at
that moment, no?

Maybe this can help, but this just mitigates the issue ...

After more thinking, I think we should still drop the __GFP_THISNODE flag in
alloc_and_dissolve_hugetlb_folio(). Firstly, not only it potentially cause
CMA allocation to fail, but it might also cause memory offline to fail like
I said in the commit message. Secondly, there have been no user reports
complaining about breaking the per-node hugetlb pool, although longterm
pinning, memory failure, and memory offline can potentially break the
per-node hugetlb pool.

It is quite possible that traditional users (like large DBs) do not use
CMA heavily so such a problem was not observed so far. That doesn't mean
those problems do not really matter.

CMA is just one case, as I mentioned before, other situations can also 
break the per-node hugetlb pool now.

Let's focus on the main point, why we should still keep inconsistency 
behavior to handle free and in-use hugetlb for alloc_contig_range()? 
That's really confused.