On Wed 24-06-20 09:16:43, Ben Widawsky wrote: > On 20-06-24 09:52:16, Michal Hocko wrote: > > On Tue 23-06-20 09:12:11, Ben Widawsky wrote: > > > On 20-06-23 13:20:48, Michal Hocko wrote: > > [...] > > > > It would be also great to provide a high level semantic description > > > > here. I have very quickly glanced through patches and they are not > > > > really trivial to follow with many incremental steps so the higher level > > > > intention is lost easily. > > > > > > > > Do I get it right that the default semantic is essentially > > > > - allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL > > > > semantic) > > > > - fallback to numa unrestricted allocation with the default > > > > numa policy on the failure > > > > > > > > Or are there any usecases to modify how hard to keep the preference over > > > > the fallback? > > > > > > tl;dr is: yes, and no usecases. > > > > OK, then I am wondering why the change has to be so involved. Except for > > syscall plumbing the only real change to the allocator path would be > > something like > > > > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > > { > > /* Lower zones don't get a nodemask applied for MPOL_BIND */ > > if (unlikely(policy->mode == MPOL_BIND || > > policy->mode == MPOL_PREFERED_MANY) && > > apply_policy_zone(policy, gfp_zone(gfp)) && > > cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)) > > return &policy->v.nodes; > > > > return NULL; > > } > > > > alloc_pages_current > > > > if (pol->mode == MPOL_INTERLEAVE) > > page = alloc_page_interleave(gfp, order, interleave_nodes(pol)); > > else { > > gfp_t gfp_attempt = gfp; > > > > /* > > * Make sure the first allocation attempt will try hard > > * but eventually fail without OOM killer or other > > * disruption before falling back to the full nodemask > > */ > > if (pol->mode == MPOL_PREFERED_MANY) > > gfp_attempt |= __GFP_RETRY_MAYFAIL; > > > > page = __alloc_pages_nodemask(gfp_attempt, order, > > policy_node(gfp, pol, numa_node_id()), > > policy_nodemask(gfp, pol)); > > if (!page && pol->mode == MPOL_PREFERED_MANY) > > page = __alloc_pages_nodemask(gfp, order, > > numa_node_id(), NULL); > > } > > > > return page; > > > > similar (well slightly more hairy) in alloc_pages_vma > > > > Or do I miss something that really requires more involved approach like > > building custom zonelists and other larger changes to the allocator? > > I think I'm missing how this allows selecting from multiple preferred nodes. In > this case when you try to get the page from the freelist, you'll get the > zonelist of the preferred node, and when you actually scan through on page > allocation, you have no way to filter out the non-preferred nodes. I think the > plumbing of multiple nodes has to go all the way through > __alloc_pages_nodemask(). But it's possible I've missed the point. policy_nodemask() will provide the nodemask which will be used as a filter on the policy_node. -- Michal Hocko SUSE Labs