Hi Mike: Thanks for your review. I notice that this patch has already been in linux-next. Do I need to submit another version of this patch or provide a new patch to fix the problems you pointed out? Mike Kravetz <mike.kravetz@xxxxxxxxxx> 于2021年10月2日周六 上午6:33写道: > > On 9/27/21 3:41 AM, Zhenguo Yao wrote: > > We can specify the number of hugepages to allocate at boot. But the > > hugepages is balanced in all nodes at present. In some scenarios, > > we only need hugepages in one node. For example: DPDK needs hugepages > > which are in the same node as NIC. if DPDK needs four hugepages of 1G > > size in node1 and system has 16 numa nodes. We must reserve 64 hugepages > > in kernel cmdline. But, only four hugepages are used. The others should > > be free after boot. If the system memory is low(for example: 64G), it will > > be an impossible task. So, Extending hugepages parameter to support > > specifying hugepages at a specific node. > > For example add following parameter: > > > > hugepagesz=1G hugepages=0:1,1:3 > > > > It will allocate 1 hugepage in node0 and 3 hugepages in node1. > > > > Signed-off-by: Zhenguo Yao <yaozhenguo1@xxxxxxxxx> > > Thanks for your continued efforts! And, thank you Mike R. for your > input. > > Just a few very minor comments below. Everything else looks good. > With suggested updates, > > Reviewed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > > > --- > > v6->v7 changes: > > - replace nodes_weight(node_states[N_MEMORY] with nr_online_nodes. > > v5->v6 changes: > > - Remove v5 codes: using return value to disable node specific alloc. > > - Add node_specific_alloc_support weak function to disable node > > specific alloc when arch can't support it. > > - Remove useless variable addr in alloc_bootmem_huge_page. > > - Add powerpc version of node_specific_alloc_support when > > CONFIG_PPC_BOOK3S_64 is defined. > > v4->v5 changes: > > - remove BUG_ON in __alloc_bootmem_huge_page. > > - add nid parameter in __alloc_bootmem_huge_page to support > > called in both specific node alloc and normal alloc. > > - do normal alloc if architecture can't support node specific alloc. > > - return -1 in powerpc version of alloc_bootmem_huge_page when > > nid is not NUMA_NO_NODE. > > v3->v4 changes: > > - fix wrong behavior for parameter: > > hugepages=0:1,1:3 default_hugepagesz=1G > > - make the change of documentation more reasonable. > > v2->v3 changes: > > - Skip gigantic hugepages allocation if hugetlb_cma is enabled. > > - Fix wrong behavior for parameter: > > hugepagesz=2M hugepages=2 hugepages=5. > > - Update hugetlbpage.rst. > > - Fix side effects which v2 brings in. > > - add cond_resched in hugetlb_hstate_alloc_pages_onenode. > > v1->v2 changes: > > - add checking for max node to avoid array out of bounds. > > - fix wrong max_huge_pages after failed allocation. > > - fix wrong behavior when parsing parameter: hugepages=0:1,2,3:4. > > - return 0 when parsing invalid parameter in hugepages_setup > > --- > > .../admin-guide/kernel-parameters.txt | 8 +- > > Documentation/admin-guide/mm/hugetlbpage.rst | 12 +- > > arch/powerpc/mm/hugetlbpage.c | 9 +- > > include/linux/hugetlb.h | 6 +- > > mm/hugetlb.c | 153 +++++++++++++++--- > > 5 files changed, 157 insertions(+), 31 deletions(-) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > index 91ba391f9b32..9b3d8791586d 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -1599,9 +1599,11 @@ > > the number of pages of hugepagesz to be allocated. > > If this is the first HugeTLB parameter on the command > > line, it specifies the number of pages to allocate for > > - the default huge page size. See also > > - Documentation/admin-guide/mm/hugetlbpage.rst. > > - Format: <integer> > > + the default huge page size. If using node format, the > > + number of pages to allocate per-node can be specified. > > + See also Documentation/admin-guide/mm/hugetlbpage.rst. > > + Format: <integer> or (node format) > > + <node>:<integer>[,<node>:<integer>] > > > > hugepagesz= > > [HW] The size of the HugeTLB pages. This is used in > > diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst > > index 8abaeb144e44..d70828c07658 100644 > > --- a/Documentation/admin-guide/mm/hugetlbpage.rst > > +++ b/Documentation/admin-guide/mm/hugetlbpage.rst > > @@ -128,7 +128,9 @@ hugepages > > implicitly specifies the number of huge pages of default size to > > allocate. If the number of huge pages of default size is implicitly > > specified, it can not be overwritten by a hugepagesz,hugepages > > - parameter pair for the default size. > > + parameter pair for the default size. This parameter also has a > > + node format. The node format specifies the number of huge pages > > + to allocate on specific nodes. > > > > For example, on an architecture with 2M default huge page size:: > > > > @@ -138,6 +140,14 @@ hugepages > > indicating that the hugepages=512 parameter is ignored. If a hugepages > > parameter is preceded by an invalid hugepagesz parameter, it will > > be ignored. > > + > > + Node format example:: > > + > > + hugepagesz=2M hugepages=0:1,1:2 > > + > > + It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1. > > + If the node number is invalid, the parameter will be ignored. > > + > > default_hugepagesz > > Specify the default huge page size. This parameter can > > only be specified once on the command line. default_hugepagesz can > > diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c > > index 9a75ba078e1b..dd40ce6e7565 100644 > > --- a/arch/powerpc/mm/hugetlbpage.c > > +++ b/arch/powerpc/mm/hugetlbpage.c > > @@ -229,17 +229,22 @@ static int __init pseries_alloc_bootmem_huge_page(struct hstate *hstate) > > m->hstate = hstate; > > return 1; > > } > > + > > +bool __init node_specific_alloc_support(void) > > +{ > > + return false; > > +} > > #endif > > > > > > -int __init alloc_bootmem_huge_page(struct hstate *h) > > +int __init alloc_bootmem_huge_page(struct hstate *h, int nid) > > { > > > > #ifdef CONFIG_PPC_BOOK3S_64 > > if (firmware_has_feature(FW_FEATURE_LPAR) && !radix_enabled()) > > return pseries_alloc_bootmem_huge_page(h); > > #endif > > - return __alloc_bootmem_huge_page(h); > > + return __alloc_bootmem_huge_page(h, nid); > > } > > > > #ifndef CONFIG_PPC_BOOK3S_64 > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > > index 1faebe1cd0ed..3504e407567c 100644 > > --- a/include/linux/hugetlb.h > > +++ b/include/linux/hugetlb.h > > @@ -605,6 +605,7 @@ struct hstate { > > unsigned long nr_overcommit_huge_pages; > > struct list_head hugepage_activelist; > > struct list_head hugepage_freelists[MAX_NUMNODES]; > > + unsigned int max_huge_pages_node[MAX_NUMNODES]; > > unsigned int nr_huge_pages_node[MAX_NUMNODES]; > > unsigned int free_huge_pages_node[MAX_NUMNODES]; > > unsigned int surplus_huge_pages_node[MAX_NUMNODES]; > > @@ -637,8 +638,9 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, > > unsigned long address, struct page *page); > > > > /* arch callback */ > > -int __init __alloc_bootmem_huge_page(struct hstate *h); > > -int __init alloc_bootmem_huge_page(struct hstate *h); > > +int __init __alloc_bootmem_huge_page(struct hstate *h, int nid); > > +int __init alloc_bootmem_huge_page(struct hstate *h, int nid); > > +bool __init node_specific_alloc_support(void); > > Agree with Mike R. that hugetlb should be in the name. > > > > > void __init hugetlb_add_hstate(unsigned order); > > bool __init arch_hugetlb_valid_size(unsigned long size); > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 95dc7b83381f..ca00676a1bdd 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -66,6 +66,7 @@ static struct hstate * __initdata parsed_hstate; > > static unsigned long __initdata default_hstate_max_huge_pages; > > static bool __initdata parsed_valid_hugepagesz = true; > > static bool __initdata parsed_default_hugepagesz; > > +static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata; > > > > /* > > * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages, > > @@ -2868,33 +2869,41 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, > > return ERR_PTR(-ENOSPC); > > } > > > > -int alloc_bootmem_huge_page(struct hstate *h) > > +int alloc_bootmem_huge_page(struct hstate *h, int nid) > > __attribute__ ((weak, alias("__alloc_bootmem_huge_page"))); > > -int __alloc_bootmem_huge_page(struct hstate *h) > > +int __alloc_bootmem_huge_page(struct hstate *h, int nid) > > { > > struct huge_bootmem_page *m; > > int nr_nodes, node; > > > > + if (nid >= nr_online_nodes) > > + return 0; > > + /* do node specific alloc */ > > + if (nid != NUMA_NO_NODE) { > > + m = memblock_alloc_try_nid_raw(huge_page_size(h), huge_page_size(h), > > + 0, MEMBLOCK_ALLOC_ACCESSIBLE, nid); > > + if (m) > > + goto found; > > + else > > + return 0; > > + } > > + /* do all node balanced alloc */ > > I'm sure you saw Smatch does not like this comment. Perhaps, > /* allocate from next node when distributing huge pages */ > > > for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) { > > - void *addr; > > - > > - addr = memblock_alloc_try_nid_raw( > > + m = memblock_alloc_try_nid_raw( > > huge_page_size(h), huge_page_size(h), > > 0, MEMBLOCK_ALLOC_ACCESSIBLE, node); > > - if (addr) { > > - /* > > - * Use the beginning of the huge page to store the > > - * huge_bootmem_page struct (until gather_bootmem > > - * puts them into the mem_map). > > - */ > > - m = addr; > > + /* > > + * Use the beginning of the huge page to store the > > + * huge_bootmem_page struct (until gather_bootmem > > + * puts them into the mem_map). > > + */ > > + if (m) > > goto found; > > - } > > + else > > + return 0; > > } > > - return 0; > > > > found: > > - BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h))); > > /* Put them into a private list first because mem_map is not up yet */ > > INIT_LIST_HEAD(&m->list); > > list_add(&m->list, &huge_boot_pages); > > @@ -2934,12 +2943,61 @@ static void __init gather_bootmem_prealloc(void) > > cond_resched(); > > } > > } > > +static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid) > > +{ > > + unsigned long i; > > + char buf[32]; > > + > > + for (i = 0; i < h->max_huge_pages_node[nid]; ++i) { > > + if (hstate_is_gigantic(h)) { > > + if (!alloc_bootmem_huge_page(h, nid)) > > + break; > > + } else { > > + struct page *page; > > + gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; > > + > > + page = alloc_fresh_huge_page(h, gfp_mask, nid, > > + &node_states[N_MEMORY], NULL); > > + if (!page) > > + break; > > + put_page(page); /* free it into the hugepage allocator */ > > + } > > + cond_resched(); > > + } > > + if (i == h->max_huge_pages_node[nid]) > > + return; > > + > > + string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32); > > + pr_warn("HugeTLB: allocating %u of page size %s failed node%d. Only allocated %lu hugepages.\n", > > + h->max_huge_pages_node[nid], buf, nid, i); > > + h->max_huge_pages -= (h->max_huge_pages_node[nid] - i); > > + h->max_huge_pages_node[nid] = i; > > +} > > > > static void __init hugetlb_hstate_alloc_pages(struct hstate *h) > > { > > unsigned long i; > > nodemask_t *node_alloc_noretry; > > + bool node_specific_alloc = false; > > + > > + /* skip gigantic hugepages allocation if hugetlb_cma enabled */ > > + if (hstate_is_gigantic(h) && hugetlb_cma_size) { > > + pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n"); > > + return; > > + } > > + > > + /* do node specific alloc */ > > + for (i = 0; i < nr_online_nodes; i++) { > > + if (h->max_huge_pages_node[i] > 0) { > > + hugetlb_hstate_alloc_pages_onenode(h, i); > > + node_specific_alloc = true; > > + } > > + } > > > > + if (node_specific_alloc) > > + return; > > + > > + /* bellow will do all node balanced alloc */ > > below > > > if (!hstate_is_gigantic(h)) { > > /* > > * Bit mask controlling how hard we retry per-node allocations. > > @@ -2960,11 +3018,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h) > > > > for (i = 0; i < h->max_huge_pages; ++i) { > > if (hstate_is_gigantic(h)) { > > - if (hugetlb_cma_size) { > > - pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n"); > > - goto free; > > - } > > - if (!alloc_bootmem_huge_page(h)) > > + if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE)) > > break; > > } else if (!alloc_pool_huge_page(h, > > &node_states[N_MEMORY], > > @@ -2980,7 +3034,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h) > > h->max_huge_pages, buf, i); > > h->max_huge_pages = i; > > } > > -free: > > kfree(node_alloc_noretry); > > } > > > > @@ -3671,6 +3724,11 @@ static int __init hugetlb_init(void) > > > default_hstate.max_huge_pages = > > default_hstate_max_huge_pages; > > + > > + for (i = 0; i < nr_online_nodes; i++) > > + if (default_hugepages_in_node[i] > 0) > > + default_hstate.max_huge_pages_node[i] = > > + default_hugepages_in_node[i]; > > Very minor nit. Not necessary to change. > The 'if (default_hugepages_in_node[i] > 0)' is not really needed. I am > not sure if any potential speed up by the check would be noticable. You > need to 'read' every value in the array. > > > } > > } > > > > @@ -3731,6 +3789,10 @@ void __init hugetlb_add_hstate(unsigned int order) > > parsed_hstate = h; > > } > > > > +bool __init __weak node_specific_alloc_support(void) > > +{ > > + return true; > > +} > > /* > > * hugepages command line processing > > * hugepages normally follows a valid hugepagsz or default_hugepagsz > > @@ -3742,6 +3804,10 @@ static int __init hugepages_setup(char *s) > > { > > unsigned long *mhp; > > static unsigned long *last_mhp; > > + int node = NUMA_NO_NODE; > > + int count; > > + unsigned long tmp; > > + char *p = s; > > > > if (!parsed_valid_hugepagesz) { > > pr_warn("HugeTLB: hugepages=%s does not follow a valid hugepagesz, ignoring\n", s); > > @@ -3765,8 +3831,40 @@ static int __init hugepages_setup(char *s) > > return 0; > > } > > > > - if (sscanf(s, "%lu", mhp) <= 0) > > - *mhp = 0; > > + while (*p) { > > + count = 0; > > + if (sscanf(p, "%lu%n", &tmp, &count) != 1) > > + goto invalid; > > + /* Parameter is node format */ > > + if (p[count] == ':') { > > + if (!node_specific_alloc_support()) { > > + pr_warn("HugeTLB: architecture can't support node specific alloc, ignoring!\n"); > > + return 0; > > + } > > + node = tmp; > > + p += count + 1; > > + if (node < 0 || node >= nr_online_nodes) > > + goto invalid; > > + /* Parse hugepages */ > > + if (sscanf(p, "%lu%n", &tmp, &count) != 1) > > + goto invalid; > > + if (!hugetlb_max_hstate) > > + default_hugepages_in_node[node] = tmp; > > + else > > + parsed_hstate->max_huge_pages_node[node] = tmp; > > + *mhp += tmp; > > + /* Go to parse next node*/ > > + if (p[count] == ',') > > + p += count + 1; > > + else > > + break; > > + } else { > > + if (p != s) > > + goto invalid; > > + *mhp = tmp; > > + break; > > + } > > + } > > > > /* > > * Global state is always initialized later in hugetlb_init. > > @@ -3779,6 +3877,10 @@ static int __init hugepages_setup(char *s) > > last_mhp = mhp; > > > > return 1; > > + > > +invalid: > > + pr_warn("HugeTLB: Invalid hugepages parameter %s\n", p); > > + return 0; > > } > > __setup("hugepages=", hugepages_setup); > > > > @@ -3840,6 +3942,7 @@ __setup("hugepagesz=", hugepagesz_setup); > > static int __init default_hugepagesz_setup(char *s) > > { > > unsigned long size; > > + int i; > > > > parsed_valid_hugepagesz = false; > > if (parsed_default_hugepagesz) { > > @@ -3868,6 +3971,10 @@ static int __init default_hugepagesz_setup(char *s) > > */ > > if (default_hstate_max_huge_pages) { > > default_hstate.max_huge_pages = default_hstate_max_huge_pages; > > + for (i = 0; i < nr_online_nodes; i++) > > + if (default_hugepages_in_node[i] > 0) > > + default_hstate.max_huge_pages_node[i] = > > + default_hugepages_in_node[i]; > > Same minor nit as in hugetlb_init. > > > if (hstate_is_gigantic(&default_hstate)) > > hugetlb_hstate_alloc_pages(&default_hstate); > > default_hstate_max_huge_pages = 0; > > > > -- > Mike Kravetz