On Wed, Aug 03, 2022 at 03:36:41PM +0800, Michal Hocko wrote: > On Wed 03-08-22 14:41:20, Feng Tang wrote: > > On Tue, Aug 02, 2022 at 05:02:37PM +0800, Michal Hocko wrote: > > > Please make sure to CC Mike on hugetlb related changes. > > > > OK. > > > > > I didn't really get to grasp your proposed solution but it feels goind > > > sideways. The real issue is that hugetlb uses a dedicated allocation > > > scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not > > > think we should be tricking that by providing some fake nodemasks and > > > what not. > > > > > > The good news is that allocation from the pool is MPOL_PREFERRED_MANY > > > aware because it first tries to allocation from the preffered node mask > > > and then fall back to the full nodemask (dequeue_huge_page_vma). > > > If the existing pools cannot really satisfy that allocation then it > > > tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also > > > performs 2 stage allocation with the node mask and no node masks. But > > > both of them might fail. > > > > > > The bad news is that other allocation functions - including those that > > > allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g. > > > __nr_hugepages_store_common paths which use the allocating process > > > policy to fill up the pool so the pool could be under provisioned if > > > that context is using MPOL_PREFERRED_MANY. > > > > Thanks for the check! > > > > So you mean if the prferred nodes don't have enough pages, we should > > also fallback to all like dequeue_huge_page_vma() does? > > > > Or we can user a policy API which return nodemask for MPOL_BIND and > > NULL for all other policies, like allowed_mems_nr() needs. > > > > --- a/include/linux/mempolicy.h > > +++ b/include/linux/mempolicy.h > > @@ -158,6 +158,18 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp) > > return policy_nodemask(gfp, mpol); > > } > > > > +#ifdef CONFIG_HUGETLB_FS > > +static inline nodemask_t *strict_policy_nodemask_current(void) > > +{ > > + struct mempolicy *mpol = get_task_policy(current); > > + > > + if (mpol->mode == MPOL_BIND) > > + return &mpol->nodes; > > + > > + return NULL; > > +} > > +#endif > > Yes something like this, except that I would also move this into hugetlb > proper because this doesn't seem generally useful. Ok, I change it as below: --- mm/hugetlb.c | 28 +++++++++++++++++++++++----- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 14be38822cf8..ef1d4ffa733f 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -91,6 +91,24 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp; /* Forward declaration */ static int hugetlb_acct_memory(struct hstate *h, long delta); +/* + * Return nodemask of what is allowed by current process' memory + * policy, as MPOL_BIND is the only 'strict' policy, return NULL + * for all other policies + */ +static inline nodemask_t *allowed_policy_nodemask_current(void) +{ +#ifdef CONFIG_NUMA + struct mempolicy *mpol = get_task_policy(current); + + if (mpol->mode == MPOL_BIND) + return &mpol->nodes; + return NULL; +#else + return NULL; +#endif +} + static inline bool subpool_is_free(struct hugepage_subpool *spool) { if (spool->count) @@ -3556,7 +3574,7 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy, unsigned long count, size_t len) { int err; - nodemask_t nodes_allowed, *n_mask; + nodemask_t nodes_allowed, *n_mask = NULL; if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) return -EINVAL; @@ -3565,11 +3583,11 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy, /* * global hstate attribute */ - if (!(obey_mempolicy && - init_nodemask_of_mempolicy(&nodes_allowed))) + if (obey_mempolicy) + n_mask = allowed_policy_nodemask_current(); + + if (!n_mask) n_mask = &node_states[N_MEMORY]; - else - n_mask = &nodes_allowed; } else { /* * Node specific request. count adjustment happens in -- 2.27.0 > > > Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation > > > code and I have to admit I do not really remember details there. This is > > > a subtle code and my best guess would be that policy_nodemask_current > > > should be hugetlb specific and only care about MPOL_BIND. > > > > The API needed by allowed_mem_nr() is a little different as it has gfp > > flag and cpuset config to consider. > > Why would gfp mask matter? I'm not very familiar with the old semantics (will check more), from current code, it checks both the gfp flags and cpuset limit. Thanks, Feng > -- > Michal Hocko > SUSE Labs