On Fri 02-02-24 17:29:02, Baolin Wang wrote: > On 2/2/2024 4:17 PM, Michal Hocko wrote: [...] > > > Agree. So how about below changing? > > > (1) disallow fallbacking to other nodes when handing in-use hugetlb, which > > > can ensure consistent behavior in handling hugetlb. > > > > I can see two cases here. alloc_contig_range which is an internal kernel > > user and then we have memory offlining. The former shouldn't break the > > per-node hugetlb pool reservations, the latter might not have any other > > choice (the whole node could get offline and that resembles breaking cpu > > affininty if the cpu is gone). > > IMO, not always true for memory offlining, when handling a free hugetlb, it > disallows fallbacking, which is inconsistent. It's been some time I've looked into that code so I am not 100% sure how the free pool is currently handled. The above is the way I _think_ it should work from the usability POV. > Not only memory offlining, but also the longterm pinning (in > migrate_longterm_unpinnable_pages()) and memory failure (in > soft_offline_in_use_page()) can also break the per-node hugetlb pool > reservations. Bad > > Now I can see how a hugetlb page sitting inside a CMA region breaks CMA > > users expectations but hugetlb migration already tries hard to allocate > > a replacement hugetlb so the system must be under a heavy memory > > pressure if that fails, right? Is it possible that the hugetlb > > reservation is just overshooted here? Maybe the memory is just terribly > > fragmented though? > > > > Could you be more specific about numbers in your failure case? > > Sure. Our customer's machine contains serveral numa nodes, and the system > reserves a large number of CMA memory occupied 50% of the total memory which > is used for the virtual machine, meanwhile it also reserves lots of hugetlb > which can occupy 50% of the CMA. So before starting the virtual machine, the > hugetlb can use 50% of the CMA, but when starting the virtual machine, the > CMA will be used by the virtual machine and the hugetlb should be migrated > from CMA. Would it make more sense for hugetlb pages to _not_ use CMA in this case? I mean would be better off overall if the hugetlb pool was preallocated before the CMA is reserved? I do realize this is just working around the current limitations but it could be better than nothing. > Due to several nodes in the system, one node's memory can be exhausted, > which will fail the hugetlb migration with __GFP_THISNODE flag. Is the workload NUMA aware? I.e. do you bind virtual machines to specific nodes? -- Michal Hocko SUSE Labs