On 10/13/21 7:23 PM, Baolin Wang wrote: > > > On 2021/10/14 6:06, Mike Kravetz wrote: >> On 10/9/21 10:24 PM, Baolin Wang wrote: >>> Now the size of CMA area for gigantic hugepages runtime allocation is >>> balanced for all online nodes, but we also want to specify the size of >>> CMA per-node, or only one node in some cases, which are similar with >>> commit 86acc55c3d32 ("hugetlbfs: extend the definition of hugepages >>> parameter to support node allocation")[1]. >>> >>> Thus this patch adds node format for 'hugetlb_cma' parameter to support >>> specifying the size of CMA per-node. An example is as follows: >>> >>> hugetlb_cma=0:5G,2:5G >>> >>> which means allocating 5G size of CMA area on node 0 and node 2 >>> respectively. >>> >>> [1] >>> https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@xxxxxxxxx >>> >>> Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> >>> --- >>> Documentation/admin-guide/kernel-parameters.txt | 6 +- >>> mm/hugetlb.c | 79 +++++++++++++++++++++---- >>> 2 files changed, 73 insertions(+), 12 deletions(-) >>> >>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt >>> index 3ad8e9d0..a147faa5 100644 >>> --- a/Documentation/admin-guide/kernel-parameters.txt >>> +++ b/Documentation/admin-guide/kernel-parameters.txt >>> @@ -1587,8 +1587,10 @@ >>> registers. Default set by CONFIG_HPET_MMAP_DEFAULT. >>> hugetlb_cma= [HW,CMA] The size of a CMA area used for allocation >>> - of gigantic hugepages. >>> - Format: nn[KMGTPE] >>> + of gigantic hugepages. Or using node format, the size >>> + of a CMA area per node can be specified. >>> + Format: nn[KMGTPE] or (node format) >>> + <node>:nn[KMGTPE][,<node>:nn[KMGTPE]] >>> Reserve a CMA area of given size and allocate gigantic >>> hugepages using the CMA allocator. If enabled, the >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index 6d2f4c2..8b4e409 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -50,6 +50,7 @@ >>> #ifdef CONFIG_CMA >>> static struct cma *hugetlb_cma[MAX_NUMNODES]; >>> +static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata; >>> static bool hugetlb_cma_page(struct page *page, unsigned int order) >>> { >>> return cma_pages_valid(hugetlb_cma[page_to_nid(page)], page, >>> @@ -62,6 +63,7 @@ static bool hugetlb_cma_page(struct page *page, unsigned int order) >>> } >>> #endif >>> static unsigned long hugetlb_cma_size __initdata; >>> +static nodemask_t hugetlb_cma_nodes_allowed = NODE_MASK_NONE; >>> /* >>> * Minimum page order among possible hugepage sizes, set to a proper value >>> @@ -3497,9 +3499,15 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy, >>> if (nid == NUMA_NO_NODE) { >>> /* >>> + * If we've specified the size of CMA area per node, >>> + * should use it firstly. >>> + */ >>> + if (hstate_is_gigantic(h) && !nodes_empty(hugetlb_cma_nodes_allowed)) >>> + n_mask = &hugetlb_cma_nodes_allowed; >>> + /* >> >> IIUC, this changes the behavior for 'balanced' gigantic huge page pool >> allocations if per-node hugetlb_cma is specified. It will now only >> attempt to allocate gigantic pages on nodes where CMA was reserved. >> Even if we run out of space on the node, it will not go to other nodes >> as before. Is that correct? > > Right. > >> >> I do not believe we want this change in behavior. IMO, if the user is >> doing node specific CMA reservations, then the user should use the node >> specific sysfs file for pool allocations on that node. > > Sounds more reasonable, will move 'hugetlb_cma_nodes_allowed' to the node specific allocation. > >>> * global hstate attribute >>> */ >>> - if (!(obey_mempolicy && >>> + else if (!(obey_mempolicy && >>> init_nodemask_of_mempolicy(&nodes_allowed))) >>> n_mask = &node_states[N_MEMORY]; >>> else >>> @@ -6745,7 +6753,38 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) >>> static int __init cmdline_parse_hugetlb_cma(char *p) >>> { >>> - hugetlb_cma_size = memparse(p, &p); >>> + int nid, count = 0; >>> + unsigned long tmp; >>> + char *s = p; >>> + >>> + while (*s) { >>> + if (sscanf(s, "%lu%n", &tmp, &count) != 1) >>> + break; >>> + >>> + if (s[count] == ':') { >>> + nid = tmp; >>> + if (nid < 0 || nid >= MAX_NUMNODES) >>> + break; >> >> nid can only be compared to MAX_NUMNODES because this an early param >> before numa is setup and we do not know exactly how many nodes there >> are. Is this correct? > > Yes. > >> >> Suppose one specifies an invaid node. For example, on my 2 node system >> I use the option 'hugetlb_cma=2:2G'. This is not flagged as an error >> during processing and 1G CMA is reserved on node 0 and 1G is reserved >> on node 1. Is that by design, or just chance? > > Actually we won't allocate any CMA area in this case, since in hugetlb_cma_reserve(), we will only iterate the online nodes to try to allocate CMA area, and node 2 is not in the range of online nodes in this case. > But, since it can not do node specifric allocations it falls through to the all nodes case? Here is what I see: # numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 node 0 size: 8053 MB node 0 free: 6543 MB node 1 cpus: 2 3 node 1 size: 8150 MB node 1 free: 4851 MB node distances: node 0 1 0: 10 20 1: 20 10 # reboot # dmesg | grep -i huge [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.0-rc4-mm1+ root=/dev/mapper/fedora_new--host-root ro rd.lvm.lv=fedora_new-host/root rd.lvm.lv=fedora_new-host/swap console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugetlb_cma=2:2G [ 0.008345] hugetlb_cma: reserve 2048 MiB, up to 1024 MiB per node [ 0.008349] hugetlb_cma: reserved 1024 MiB on node 0 [ 0.008352] hugetlb_cma: reserved 1024 MiB on node 1 [ 0.053682] Kernel command line: BOOT_IMAGE=/vmlinuz-5.15.0-rc4-mm1+ root=/dev/mapper/fedora_new--host-root ro rd.lvm.lv=fedora_new-host/root rd.lvm.lv=fedora_new-host/swap console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugetlb_cma=2:2G [ 0.401648] HugeTLB: can free 4094 vmemmap pages for hugepages-1048576kB [ 0.413681] HugeTLB: can free 6 vmemmap pages for hugepages-2048kB [ 0.414590] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages [ 0.415653] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages -- Mike Kravetz