On Tue, Aug 02, 2022 at 05:02:37PM +0800, Michal Hocko wrote: > Please make sure to CC Mike on hugetlb related changes. OK. > I didn't really get to grasp your proposed solution but it feels goind > sideways. The real issue is that hugetlb uses a dedicated allocation > scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not > think we should be tricking that by providing some fake nodemasks and > what not. > > The good news is that allocation from the pool is MPOL_PREFERRED_MANY > aware because it first tries to allocation from the preffered node mask > and then fall back to the full nodemask (dequeue_huge_page_vma). > If the existing pools cannot really satisfy that allocation then it > tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also > performs 2 stage allocation with the node mask and no node masks. But > both of them might fail. > > The bad news is that other allocation functions - including those that > allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g. > __nr_hugepages_store_common paths which use the allocating process > policy to fill up the pool so the pool could be under provisioned if > that context is using MPOL_PREFERRED_MANY. Thanks for the check! So you mean if the prferred nodes don't have enough pages, we should also fallback to all like dequeue_huge_page_vma() does? Or we can user a policy API which return nodemask for MPOL_BIND and NULL for all other policies, like allowed_mems_nr() needs. --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -158,6 +158,18 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp) return policy_nodemask(gfp, mpol); } +#ifdef CONFIG_HUGETLB_FS +static inline nodemask_t *strict_policy_nodemask_current(void) +{ + struct mempolicy *mpol = get_task_policy(current); + + if (mpol->mode == MPOL_BIND) + return &mpol->nodes; + + return NULL; +} +#endif + > Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation > code and I have to admit I do not really remember details there. This is > a subtle code and my best guess would be that policy_nodemask_current > should be hugetlb specific and only care about MPOL_BIND. The API needed by allowed_mem_nr() is a little different as it has gfp flag and cpuset config to consider. Thanks, Feng [snip]