On 20-06-24 09:52:16, Michal Hocko wrote: > On Tue 23-06-20 09:12:11, Ben Widawsky wrote: > > On 20-06-23 13:20:48, Michal Hocko wrote: > [...] > > > It would be also great to provide a high level semantic description > > > here. I have very quickly glanced through patches and they are not > > > really trivial to follow with many incremental steps so the higher level > > > intention is lost easily. > > > > > > Do I get it right that the default semantic is essentially > > > - allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL > > > semantic) > > > - fallback to numa unrestricted allocation with the default > > > numa policy on the failure > > > > > > Or are there any usecases to modify how hard to keep the preference over > > > the fallback? > > > > tl;dr is: yes, and no usecases. > > OK, then I am wondering why the change has to be so involved. Except for > syscall plumbing the only real change to the allocator path would be > something like > > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > { > /* Lower zones don't get a nodemask applied for MPOL_BIND */ > if (unlikely(policy->mode == MPOL_BIND || > policy->mode == MPOL_PREFERED_MANY) && > apply_policy_zone(policy, gfp_zone(gfp)) && > cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)) > return &policy->v.nodes; > > return NULL; > } > > alloc_pages_current > > if (pol->mode == MPOL_INTERLEAVE) > page = alloc_page_interleave(gfp, order, interleave_nodes(pol)); > else { > gfp_t gfp_attempt = gfp; > > /* > * Make sure the first allocation attempt will try hard > * but eventually fail without OOM killer or other > * disruption before falling back to the full nodemask > */ > if (pol->mode == MPOL_PREFERED_MANY) > gfp_attempt |= __GFP_RETRY_MAYFAIL; > > page = __alloc_pages_nodemask(gfp_attempt, order, > policy_node(gfp, pol, numa_node_id()), > policy_nodemask(gfp, pol)); > if (!page && pol->mode == MPOL_PREFERED_MANY) > page = __alloc_pages_nodemask(gfp, order, > numa_node_id(), NULL); > } > > return page; > > similar (well slightly more hairy) in alloc_pages_vma > > Or do I miss something that really requires more involved approach like > building custom zonelists and other larger changes to the allocator? I think I'm missing how this allows selecting from multiple preferred nodes. In this case when you try to get the page from the freelist, you'll get the zonelist of the preferred node, and when you actually scan through on page allocation, you have no way to filter out the non-preferred nodes. I think the plumbing of multiple nodes has to go all the way through __alloc_pages_nodemask(). But it's possible I've missed the point. I do have a branch where I build a custom zonelist, but that's not the reason here :-)