On Wed 03-08-22 14:41:20, Feng Tang wrote: > On Tue, Aug 02, 2022 at 05:02:37PM +0800, Michal Hocko wrote: > > Please make sure to CC Mike on hugetlb related changes. > > OK. > > > I didn't really get to grasp your proposed solution but it feels goind > > sideways. The real issue is that hugetlb uses a dedicated allocation > > scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not > > think we should be tricking that by providing some fake nodemasks and > > what not. > > > > The good news is that allocation from the pool is MPOL_PREFERRED_MANY > > aware because it first tries to allocation from the preffered node mask > > and then fall back to the full nodemask (dequeue_huge_page_vma). > > If the existing pools cannot really satisfy that allocation then it > > tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also > > performs 2 stage allocation with the node mask and no node masks. But > > both of them might fail. > > > > The bad news is that other allocation functions - including those that > > allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g. > > __nr_hugepages_store_common paths which use the allocating process > > policy to fill up the pool so the pool could be under provisioned if > > that context is using MPOL_PREFERRED_MANY. > > Thanks for the check! > > So you mean if the prferred nodes don't have enough pages, we should > also fallback to all like dequeue_huge_page_vma() does? > > Or we can user a policy API which return nodemask for MPOL_BIND and > NULL for all other policies, like allowed_mems_nr() needs. > > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -158,6 +158,18 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp) > return policy_nodemask(gfp, mpol); > } > > +#ifdef CONFIG_HUGETLB_FS > +static inline nodemask_t *strict_policy_nodemask_current(void) > +{ > + struct mempolicy *mpol = get_task_policy(current); > + > + if (mpol->mode == MPOL_BIND) > + return &mpol->nodes; > + > + return NULL; > +} > +#endif Yes something like this, except that I would also move this into hugetlb proper because this doesn't seem generally useful. > > Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation > > code and I have to admit I do not really remember details there. This is > > a subtle code and my best guess would be that policy_nodemask_current > > should be hugetlb specific and only care about MPOL_BIND. > > The API needed by allowed_mem_nr() is a little different as it has gfp > flag and cpuset config to consider. Why would gfp mask matter? -- Michal Hocko SUSE Labs