On Mon 31-08-20 14:44:40, Mike Kravetz wrote: > On 8/30/20 7:04 AM, Li Xinhai wrote: > > Since commit cf11e85fc08cc6a4 ("mm: hugetlb: optionally allocate gigantic > > hugepages using cma"), the gigantic page would be allocated from node > > which is not the preferred node, although there are pages available from > > that node. The reason is that the nid parameter has been ignored in > > alloc_gigantic_page(). > > > > After this patch, the preferred node is tried first before other allowed > > nodes. > > Thank you! > This is an issue that needs to be fixed. > > > Fixes: cf11e85fc08cc6a4 ("mm: hugetlb: optionally allocate gigantic hugepages using cma") > > Cc: Roman Gushchin <guro@xxxxxx> > > Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > > Cc: Michal Hocko <mhocko@xxxxxxxxxx> > > Signed-off-by: Li Xinhai <lixinhai.lxh@xxxxxxxxx> > > --- > > mm/hugetlb.c | 9 ++++++++- > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index a301c2d672bf..4a28b8853d47 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1256,8 +1256,15 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask, > > struct page *page; > > int node; > > > > + if (hugetlb_cma[nid]) { > > + page = cma_alloc(hugetlb_cma[nid], nr_pages, > > + huge_page_order(h), true); > > + if (page) > > + return page; > > + } > > + > > When looking at your changes, I noticed that this code for allocation > from CMA does not take gfp_mask into account. The 'normal' use case > is to allocate pool pages with something similar to: > > echo 16 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages > > The routine alloc_pool_huge_page will try to interleave pages among nodes: > > ... > gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; > > for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { > ... > > which will eventually call alloc_gigantic_page. If __GFP_THISNODE is > set we really do not want to execute the below for loop in alloc_gigantic_page. Yes, this is the case indeed. > I think the convention in the mm code is that only the lowest level > allocation routines should interpret the GFP flags. We may need to make > an exception here and check for __GFP_THISNODE. Yes this is true, But alloc_gigantic_page is actually low level allocation routine in fact. I would go with the following diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a301c2d672bf..124754240b56 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1256,6 +1256,16 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask, struct page *page; int node; + if (nid != NUMA_NO_NODE && hugetlb_cma[nid]) { + page = cma_alloc(hugetlb_cma[nid], nr_pages, + huge_page_order(h), true); + if (page) + return page; + } + + if (gfp_mask & __GFP_THISNODE) + return NULL; + for_each_node_mask(node, *nodemask) { if (!hugetlb_cma[node]) continue; I do not think we actually do have an explicit NUMA_NO_NODE user but it is safer to not asume that here. -- Michal Hocko SUSE Labs