On 20-06-24 09:52:16, Michal Hocko wrote: > On Tue 23-06-20 09:12:11, Ben Widawsky wrote: > > On 20-06-23 13:20:48, Michal Hocko wrote: > [...] > > > It would be also great to provide a high level semantic description > > > here. I have very quickly glanced through patches and they are not > > > really trivial to follow with many incremental steps so the higher level > > > intention is lost easily. > > > > > > Do I get it right that the default semantic is essentially > > > - allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL > > > semantic) > > > - fallback to numa unrestricted allocation with the default > > > numa policy on the failure > > > > > > Or are there any usecases to modify how hard to keep the preference over > > > the fallback? > > > > tl;dr is: yes, and no usecases. > > OK, then I am wondering why the change has to be so involved. Except for > syscall plumbing the only real change to the allocator path would be > something like > > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > { > /* Lower zones don't get a nodemask applied for MPOL_BIND */ > if (unlikely(policy->mode == MPOL_BIND || > policy->mode == MPOL_PREFERED_MANY) && > apply_policy_zone(policy, gfp_zone(gfp)) && > cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)) > return &policy->v.nodes; > > return NULL; > } > > alloc_pages_current > > if (pol->mode == MPOL_INTERLEAVE) > page = alloc_page_interleave(gfp, order, interleave_nodes(pol)); > else { > gfp_t gfp_attempt = gfp; > > /* > * Make sure the first allocation attempt will try hard > * but eventually fail without OOM killer or other > * disruption before falling back to the full nodemask > */ > if (pol->mode == MPOL_PREFERED_MANY) > gfp_attempt |= __GFP_RETRY_MAYFAIL; > > page = __alloc_pages_nodemask(gfp_attempt, order, > policy_node(gfp, pol, numa_node_id()), > policy_nodemask(gfp, pol)); > if (!page && pol->mode == MPOL_PREFERED_MANY) > page = __alloc_pages_nodemask(gfp, order, > numa_node_id(), NULL); > } > > return page; > > similar (well slightly more hairy) in alloc_pages_vma > > Or do I miss something that really requires more involved approach like > building custom zonelists and other larger changes to the allocator? Hi Michal, I'm mostly done implementing this change. It looks good, and so far I think it's functionally equivalent. One thing though, above you use NULL for the fallback. That actually should not be NULL because of the logic in policy_node to restrict zones, and obey cpusets. I've implemented it as such, but I was hoping someone with a deeper understanding, and more experience can confirm that was the correct thing to do. Thanks.