Re: [RFC PATCH] mm: hugetlb: remove __GFP_THISNODE flag when dissolving the old hugetlb

Michal Hocko <mhocko@xxxxxxxx> · Fri, 2 Feb 2024 10:55:05 +0100

On Fri 02-02-24 17:29:02, Baolin Wang wrote:
> On 2/2/2024 4:17 PM, Michal Hocko wrote:
[...]
> > > Agree. So how about below changing?
> > > (1) disallow fallbacking to other nodes when handing in-use hugetlb, which
> > > can ensure consistent behavior in handling hugetlb.
> > 
> > I can see two cases here. alloc_contig_range which is an internal kernel
> > user and then we have memory offlining. The former shouldn't break the
> > per-node hugetlb pool reservations, the latter might not have any other
> > choice (the whole node could get offline and that resembles breaking cpu
> > affininty if the cpu is gone).
> 
> IMO, not always true for memory offlining, when handling a free hugetlb, it
> disallows fallbacking, which is inconsistent.

It's been some time I've looked into that code so I am not 100% sure how
the free pool is currently handled. The above is the way I _think_ it
should work from the usability POV.

> Not only memory offlining, but also the longterm pinning (in
> migrate_longterm_unpinnable_pages()) and memory failure (in
> soft_offline_in_use_page()) can also break the per-node hugetlb pool
> reservations.

Bad

> > Now I can see how a hugetlb page sitting inside a CMA region breaks CMA
> > users expectations but hugetlb migration already tries hard to allocate
> > a replacement hugetlb so the system must be under a heavy memory
> > pressure if that fails, right? Is it possible that the hugetlb
> > reservation is just overshooted here? Maybe the memory is just terribly
> > fragmented though?
> > 
> > Could you be more specific about numbers in your failure case?
> 
> Sure. Our customer's machine contains serveral numa nodes, and the system
> reserves a large number of CMA memory occupied 50% of the total memory which
> is used for the virtual machine, meanwhile it also reserves lots of hugetlb
> which can occupy 50% of the CMA. So before starting the virtual machine, the
> hugetlb can use 50% of the CMA, but when starting the virtual machine, the
> CMA will be used by the virtual machine and the hugetlb should be migrated
> from CMA.

Would it make more sense for hugetlb pages to _not_ use CMA in this
case? I mean would be better off overall if the hugetlb pool was
preallocated before the CMA is reserved? I do realize this is just
working around the current limitations but it could be better than
nothing.

> Due to several nodes in the system, one node's memory can be exhausted,
> which will fail the hugetlb migration with __GFP_THISNODE flag.

Is the workload NUMA aware? I.e. do you bind virtual machines to
specific nodes?

-- 
Michal Hocko
SUSE Labs