On Tue, Mar 10, 2020 at 11:33:47AM -0700, Mike Kravetz wrote: > On 3/10/20 11:05 AM, Roman Gushchin wrote: > > On Tue, Mar 10, 2020 at 10:27:01AM -0700, Mike Kravetz wrote: > >> On 3/9/20 5:25 PM, Roman Gushchin wrote: > >>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c > >>> index a74262c71484..ceeb06ddfd41 100644 > >>> --- a/arch/x86/kernel/setup.c > >>> +++ b/arch/x86/kernel/setup.c > >>> @@ -16,6 +16,7 @@ > >>> #include <linux/pci.h> > >>> #include <linux/root_dev.h> > >>> #include <linux/sfi.h> > >>> +#include <linux/hugetlb.h> > >>> #include <linux/tboot.h> > >>> #include <linux/usb/xhci-dbgp.h> > >>> > >>> @@ -1158,6 +1159,8 @@ void __init setup_arch(char **cmdline_p) > >>> initmem_init(); > >>> dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT); > >>> > >>> + hugetlb_cma_reserve(); > >>> + > >> > >> I know this is called from arch specific code here to fit in with the timing > >> of CMA setup/reservation calls. However, there really is nothing architecture > >> specific about this functionality. It would be great IMO if we could make > >> this architecture independent. However, I can not think of a straight forward > >> way to do this. > > > > I agree. Unfortunately I have no better idea than having an arch-dependent hook. > > > >> > >>> /* > >>> * Reserve memory for crash kernel after SRAT is parsed so that it > >>> * won't consume hotpluggable memory. > >> <snip> > >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >> <snip> > >>> +void __init hugetlb_cma_reserve(void) > >>> +{ > >>> + unsigned long totalpages = 0; > >>> + unsigned long start_pfn, end_pfn; > >>> + phys_addr_t size; > >>> + int nid, i, res; > >>> + > >>> + if (!hugetlb_cma_size && !hugetlb_cma_percent) > >>> + return; > >>> + > >>> + if (hugetlb_cma_percent) { > >>> + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, > >>> + NULL) > >>> + totalpages += end_pfn - start_pfn; > >>> + > >>> + size = PAGE_SIZE * (hugetlb_cma_percent * 100 * totalpages) / > >>> + 10000UL; > >>> + } else { > >>> + size = hugetlb_cma_size; > >>> + } > >>> + > >>> + pr_info("hugetlb_cma: reserve %llu, %llu per node\n", size, > >>> + size / nr_online_nodes); > >>> + > >>> + size /= nr_online_nodes; > >>> + > >>> + for_each_node_state(nid, N_ONLINE) { > >>> + unsigned long min_pfn = 0, max_pfn = 0; > >>> + > >>> + for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { > >>> + if (!min_pfn) > >>> + min_pfn = start_pfn; > >>> + max_pfn = end_pfn; > >>> + } > >>> + > >>> + res = cma_declare_contiguous(PFN_PHYS(min_pfn), size, > >>> + PFN_PHYS(max_pfn), (1UL << 30), > >> > >> The alignment is hard coded for x86 gigantic page size. If this supports > >> more architectures or becomes arch independent we will need to determine > >> what this alignment should be. Perhaps an arch specific call back to get > >> the alignment for gigantic pages. That will require a little thought as > >> some arch's support multiple gigantic page sizes. > > > > Good point! > > Should we take the biggest possible size as a reference? > > Or the smallest (larger than MAX_ORDER)? > > As mentioned, it is pretty simple for architectures like x86 that only > have one gigantic page size. Just a random thought, but since > hugetlb_cma_reserve is called from arch specific code perhaps the arch > code could pass in at least alignment as an argument to this function? > That way we can somewhat push the issue to the architectures. For example, > power supports 16GB gigantic page size but I believe they are allocated > via firmware somehow. So, they would not pass 16G as alignment. In this > case setup of the CMA area is somewhat architecture dependent. So, perhaps > the call from arch specific code is not such a bad idea. > > With that in mind, we may want some type of coordination between arch > specific and independent code. Right now, cmdline_parse_hugetlb_cma > is will accept a hugetlb_cma command line option without complaint even > if the architecture does not call hugetlb_cma_reserve. > > Just a nit, but cma_declare_contiguous if going to round up size by alignment. So, the actual reserved size may not match what is printed with, > + pr_info("hugetlb_cma: successfully reserved %llu on node %d\n", > + size, nid); > > I found this out by testing code and specifying hugetlb_cma=2M. Messages > in log were: > kernel: hugetlb_cma: reserve 2097152, 1048576 per node > kernel: hugetlb_cma: successfully reserved 1048576 on node 0 > kernel: hugetlb_cma: successfully reserved 1048576 on node 1 > But, it really reserved 1GB per node. Good point! In the passed size is too small to cover a single huge page, we should probably print a warning and bail out. Will fix in the next version. Thanks!