Re: [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Mon, 20 Jul 2020 11:17:31 -0700

On 7/19/20 11:22 PM, Anshuman Khandual wrote:
> 
> 
> On 07/17/2020 10:32 PM, Mike Kravetz wrote:
>> On 7/16/20 10:02 PM, Anshuman Khandual wrote:
>>>
>>>
>>> On 07/16/2020 11:55 PM, Mike Kravetz wrote:
>>>> >From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001
>>>> From: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
>>>> Date: Tue, 14 Jul 2020 15:54:46 -0700
>>>> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic
>>>>  hstate
>>>>
>>>> Instead of calling hugetlb_cma_reserve() directly from arch specific
>>>> code, call from hugetlb_add_hstate when adding a gigantic hstate.
>>>> hugetlb_add_hstate is either called from arch specific huge page setup,
>>>> or as the result of hugetlb command line processing.  In either case,
>>>> this is late enough in the init process that all numa memory information
>>>> should be initialized.  And, it is early enough to still use early
>>>> memory allocator.
>>>
>>> This assumes that hugetlb_add_hstate() is called from the arch code at
>>> the right point in time for the generic HugeTLB to do the required CMA
>>> reservation which is not ideal. I guess it must have been a reason why
>>> CMA reservation should always called by the platform code which knows
>>> the boot sequence timing better.
>>
>> Actually, the code does not make the assumption that hugetlb_add_hstate
>> is called from arch specific huge page setup.  It can even be called later
>> at the time of hugetlb command line processing.
> 
> Yes, now that hugetlb_cma_reserve() has been moved into hugetlb_add_hstate().
> But then there is an explicit warning while trying to mix both the command
> line options i.e hugepagesz= and hugetlb_cma=. The proposed code here have
> not changed that behavior and hence the following warning should have been
> triggered here as well.
> 
> 1) hugepagesz_setup()
> 	hugetlb_add_hstate()
> 		 hugetlb_cma_reserve()
> 
> 2) hugepages_setup()
> 	hugetlb_hstate_alloc_pages()	when order >= MAX_ORDER
> 
> 	if (hstate_is_gigantic(h)) {
>         	if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
>                 	pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
> 			break;
>                 }
> 		if (!alloc_bootmem_huge_page(h))
>                 break;
> 	}
> 
> Nonetheless, it does not make sense to mix both memblock and CMA based huge
> page pre-allocations. But looking at this again, could this warning be ever
> triggered till now ? Unless, a given platform calls hugetlb_cma_reserve()
> before _setup("hugepages=", hugepages_setup). Anyways, there seems to be
> good reasons to keep both memblock and CMA based pre-allocations in place.
> But mixing them together (as done in the proposed code here) does not seem
> to be right.

I'm not sure if I follow the question.

This proposal does not change the trigger for the warning printed when one
tries to both reserve CMA and pre-allocate gigantic pages.  If hugetlb_cma
is specified on the command line, and someone tries to pre-allocate gigantic
pages they will get the warning.  Such a command line on x86 might look like,
hugetlb_cma=4G hugepagesz=1G hugepages=4

You will then see,
[    0.065864] HugeTLB: hugetlb_cma is enabled, skip boot time allocation
[    0.065866] HugeTLB: allocating 4 of page size 1.00 GiB failed.  Only allocated 0 hugepages.

Ideally we could/should eliminate the second message.
This behavior exists in the current code.

>> My 'reasoning' is that gigantic pages can currently be preallocated from
>> bootmem/memblock_alloc at the time of command line processing.  Therefore,
>> we should be able to reserve bootmem for CMA at the same time.  Is there
>> something wrong with this reasoning?  I tested this on x86 by removing the
>> call to hugetlb_add_hstate from arch specific code and instead forced the
>> call at command line processing time.  The ability to reserve CMA was the
>> same.
> 
> There is no problem with that reasoning. __setup() triggered function should
> be able perform CMA reservation. But as pointed out before, it does not make
> sense to mix both CMA reservation and memblock based pre-allocation.

Agree.  I am not proposing we do.  Sorry, if you got that impression.

>> Yes, the CMA reservation interface says it should be called from arch
>> specific code.  However, if we currently depend on the ability to do
>> memblock_alloc at hugetlb command line processing time for gigantic page
>> preallocation, then I think we can do the CMA reservation here as well.
> 
> IIUC, CMA reservation and memblock alloc have some differences in terms of
> how the memory can be used later on, will have to dig deeper on this. But
> the comment section near cma_declare_contiguous_nid() is a concern.
> 
>  * This function reserves memory from early allocator. It should be
>  * called by arch specific code once the early allocator (memblock or bootmem)
>  * has been activated and all other subsystems have already allocated/reserved
>  * memory. This function allows to create custom reserved areas.
> 

Yes, that is the comment I was looking at as well.

However, note that hugetlb pre-allocation of gigantic pages will end up
calling memblock_alloc_range_nid.  This is the same routine used for CMA
reservations/allocations from cma_declare_contiguous_nid.  This is why
there should be no issue with doing CMA reservations at this time.

This may be the confusing part.  I am not saying we would do CMA reservations
and pre-allocations together.  Rather, they both rely on the underlying code so
we can call them at the same time in the init process.

>> Thinking about it some more, I suppose there could be some arch code that
>> could call hugetlb_add_hstate too early in the boot process.  But, I do
>> not think we have an issue with calling it too late.
>>
> 
> Calling it too late might have got the page allocator initialized completely
> and then CMA reservation would not be possible afterwards. Also calling it
> too early would prevent other subsystems which might need memory reservation
> in specific physical ranges.

I thought about it some more and came up with a way to do all this at command
line processing time.  It will take me a day or two to put together.

The patch from Barry which started this thread is indeed needed and is in
Andrew's tree.  I'll start another thread with a patch to move CMA reservations
to command line processing.
-- 
Mike Kravetz