Hi all, sorry for bumping into this discussion, but this is exactly the feature I am currently looking for on ARM64. You may browse under linux-arm-msm@xxxxxxxxxxxxxxx for a message I posted there recently. I read somewhere in this thread that it should be easily portable to other architectures than x86. I am ready to give it a try, but I would need a few pointers where to hook in in the ARM64 code. Thanks and best regards Andreas On 10.03.2020 19:33, Mike Kravetz wrote: > On 3/10/20 11:05 AM, Roman Gushchin wrote: >> On Tue, Mar 10, 2020 at 10:27:01AM -0700, Mike Kravetz wrote: >>> On 3/9/20 5:25 PM, Roman Gushchin wrote: >>>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c >>>> index a74262c71484..ceeb06ddfd41 100644 >>>> --- a/arch/x86/kernel/setup.c >>>> +++ b/arch/x86/kernel/setup.c >>>> @@ -16,6 +16,7 @@ >>>> #include <linux/pci.h> >>>> #include <linux/root_dev.h> >>>> #include <linux/sfi.h> >>>> +#include <linux/hugetlb.h> >>>> #include <linux/tboot.h> >>>> #include <linux/usb/xhci-dbgp.h> >>>> >>>> @@ -1158,6 +1159,8 @@ void __init setup_arch(char **cmdline_p) >>>> initmem_init(); >>>> dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT); >>>> >>>> + hugetlb_cma_reserve(); >>>> + >>> >>> I know this is called from arch specific code here to fit in with the timing >>> of CMA setup/reservation calls. However, there really is nothing architecture >>> specific about this functionality. It would be great IMO if we could make >>> this architecture independent. However, I can not think of a straight forward >>> way to do this. >> >> I agree. Unfortunately I have no better idea than having an arch-dependent hook. >> >>> >>>> /* >>>> * Reserve memory for crash kernel after SRAT is parsed so that it >>>> * won't consume hotpluggable memory. >>> <snip> >>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> <snip> >>>> +void __init hugetlb_cma_reserve(void) >>>> +{ >>>> + unsigned long totalpages = 0; >>>> + unsigned long start_pfn, end_pfn; >>>> + phys_addr_t size; >>>> + int nid, i, res; >>>> + >>>> + if (!hugetlb_cma_size && !hugetlb_cma_percent) >>>> + return; >>>> + >>>> + if (hugetlb_cma_percent) { >>>> + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, >>>> + NULL) >>>> + totalpages += end_pfn - start_pfn; >>>> + >>>> + size = PAGE_SIZE * (hugetlb_cma_percent * 100 * totalpages) / >>>> + 10000UL; >>>> + } else { >>>> + size = hugetlb_cma_size; >>>> + } >>>> + >>>> + pr_info("hugetlb_cma: reserve %llu, %llu per node\n", size, >>>> + size / nr_online_nodes); >>>> + >>>> + size /= nr_online_nodes; >>>> + >>>> + for_each_node_state(nid, N_ONLINE) { >>>> + unsigned long min_pfn = 0, max_pfn = 0; >>>> + >>>> + for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { >>>> + if (!min_pfn) >>>> + min_pfn = start_pfn; >>>> + max_pfn = end_pfn; >>>> + } >>>> + >>>> + res = cma_declare_contiguous(PFN_PHYS(min_pfn), size, >>>> + PFN_PHYS(max_pfn), (1UL << 30), >>> >>> The alignment is hard coded for x86 gigantic page size. If this supports >>> more architectures or becomes arch independent we will need to determine >>> what this alignment should be. Perhaps an arch specific call back to get >>> the alignment for gigantic pages. That will require a little thought as >>> some arch's support multiple gigantic page sizes. >> >> Good point! >> Should we take the biggest possible size as a reference? >> Or the smallest (larger than MAX_ORDER)? > > As mentioned, it is pretty simple for architectures like x86 that only > have one gigantic page size. Just a random thought, but since > hugetlb_cma_reserve is called from arch specific code perhaps the arch > code could pass in at least alignment as an argument to this function? > That way we can somewhat push the issue to the architectures. For example, > power supports 16GB gigantic page size but I believe they are allocated > via firmware somehow. So, they would not pass 16G as alignment. In this > case setup of the CMA area is somewhat architecture dependent. So, perhaps > the call from arch specific code is not such a bad idea. > > With that in mind, we may want some type of coordination between arch > specific and independent code. Right now, cmdline_parse_hugetlb_cma > is will accept a hugetlb_cma command line option without complaint even > if the architecture does not call hugetlb_cma_reserve. > > Just a nit, but cma_declare_contiguous if going to round up size by alignment. So, the actual reserved size may not match what is printed with, > + pr_info("hugetlb_cma: successfully reserved %llu on node %d\n", > + size, nid); > > I found this out by testing code and specifying hugetlb_cma=2M. Messages > in log were: > kernel: hugetlb_cma: reserve 2097152, 1048576 per node > kernel: hugetlb_cma: successfully reserved 1048576 on node 0 > kernel: hugetlb_cma: successfully reserved 1048576 on node 1 > But, it really reserved 1GB per node. >