On Wed 24-06-20 12:37:33, Ben Widawsky wrote: > On 20-06-24 20:39:17, Michal Hocko wrote: > > On Wed 24-06-20 09:16:43, Ben Widawsky wrote: > > > On 20-06-24 09:52:16, Michal Hocko wrote: > > > > On Tue 23-06-20 09:12:11, Ben Widawsky wrote: > > > > > On 20-06-23 13:20:48, Michal Hocko wrote: > > > > [...] > > > > > > It would be also great to provide a high level semantic description > > > > > > here. I have very quickly glanced through patches and they are not > > > > > > really trivial to follow with many incremental steps so the higher level > > > > > > intention is lost easily. > > > > > > > > > > > > Do I get it right that the default semantic is essentially > > > > > > - allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL > > > > > > semantic) > > > > > > - fallback to numa unrestricted allocation with the default > > > > > > numa policy on the failure > > > > > > > > > > > > Or are there any usecases to modify how hard to keep the preference over > > > > > > the fallback? > > > > > > > > > > tl;dr is: yes, and no usecases. > > > > > > > > OK, then I am wondering why the change has to be so involved. Except for > > > > syscall plumbing the only real change to the allocator path would be > > > > something like > > > > > > > > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > > > > { > > > > /* Lower zones don't get a nodemask applied for MPOL_BIND */ > > > > if (unlikely(policy->mode == MPOL_BIND || > > > > policy->mode == MPOL_PREFERED_MANY) && > > > > apply_policy_zone(policy, gfp_zone(gfp)) && > > > > cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)) > > > > return &policy->v.nodes; > > > > > > > > return NULL; > > > > } > > > > > > > > alloc_pages_current > > > > > > > > if (pol->mode == MPOL_INTERLEAVE) > > > > page = alloc_page_interleave(gfp, order, interleave_nodes(pol)); > > > > else { > > > > gfp_t gfp_attempt = gfp; > > > > > > > > /* > > > > * Make sure the first allocation attempt will try hard > > > > * but eventually fail without OOM killer or other > > > > * disruption before falling back to the full nodemask > > > > */ > > > > if (pol->mode == MPOL_PREFERED_MANY) > > > > gfp_attempt |= __GFP_RETRY_MAYFAIL; > > > > > > > > page = __alloc_pages_nodemask(gfp_attempt, order, > > > > policy_node(gfp, pol, numa_node_id()), > > > > policy_nodemask(gfp, pol)); > > > > if (!page && pol->mode == MPOL_PREFERED_MANY) > > > > page = __alloc_pages_nodemask(gfp, order, > > > > numa_node_id(), NULL); > > > > } > > > > > > > > return page; > > > > > > > > similar (well slightly more hairy) in alloc_pages_vma > > > > > > > > Or do I miss something that really requires more involved approach like > > > > building custom zonelists and other larger changes to the allocator? > > > > > > I think I'm missing how this allows selecting from multiple preferred nodes. In > > > this case when you try to get the page from the freelist, you'll get the > > > zonelist of the preferred node, and when you actually scan through on page > > > allocation, you have no way to filter out the non-preferred nodes. I think the > > > plumbing of multiple nodes has to go all the way through > > > __alloc_pages_nodemask(). But it's possible I've missed the point. > > > > policy_nodemask() will provide the nodemask which will be used as a > > filter on the policy_node. > > Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me > to that point. UAPI cannot get independent masks, and callers of these functions > don't yet use them. > > So let me ask before I actually type it up and find it's much much simpler, is > there not some perceived benefit to having both masks being independent? I am not sure I follow. Which two masks do you have in mind? zonelist and user provided nodemask? -- Michal Hocko SUSE Labs