On Fri 26-06-20 14:39:05, Ben Widawsky wrote: > On 20-06-24 09:52:16, Michal Hocko wrote: > > On Tue 23-06-20 09:12:11, Ben Widawsky wrote: > > > On 20-06-23 13:20:48, Michal Hocko wrote: > > [...] > > > > It would be also great to provide a high level semantic description > > > > here. I have very quickly glanced through patches and they are not > > > > really trivial to follow with many incremental steps so the higher level > > > > intention is lost easily. > > > > > > > > Do I get it right that the default semantic is essentially > > > > - allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL > > > > semantic) > > > > - fallback to numa unrestricted allocation with the default > > > > numa policy on the failure > > > > > > > > Or are there any usecases to modify how hard to keep the preference over > > > > the fallback? > > > > > > tl;dr is: yes, and no usecases. > > > > OK, then I am wondering why the change has to be so involved. Except for > > syscall plumbing the only real change to the allocator path would be > > something like > > > > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > > { > > /* Lower zones don't get a nodemask applied for MPOL_BIND */ > > if (unlikely(policy->mode == MPOL_BIND || > > policy->mode == MPOL_PREFERED_MANY) && > > apply_policy_zone(policy, gfp_zone(gfp)) && > > cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)) > > return &policy->v.nodes; > > > > return NULL; > > } > > > > alloc_pages_current > > > > if (pol->mode == MPOL_INTERLEAVE) > > page = alloc_page_interleave(gfp, order, interleave_nodes(pol)); > > else { > > gfp_t gfp_attempt = gfp; > > > > /* > > * Make sure the first allocation attempt will try hard > > * but eventually fail without OOM killer or other > > * disruption before falling back to the full nodemask > > */ > > if (pol->mode == MPOL_PREFERED_MANY) > > gfp_attempt |= __GFP_RETRY_MAYFAIL; > > > > page = __alloc_pages_nodemask(gfp_attempt, order, > > policy_node(gfp, pol, numa_node_id()), > > policy_nodemask(gfp, pol)); > > if (!page && pol->mode == MPOL_PREFERED_MANY) > > page = __alloc_pages_nodemask(gfp, order, > > numa_node_id(), NULL); > > } > > > > return page; > > > > similar (well slightly more hairy) in alloc_pages_vma > > > > Or do I miss something that really requires more involved approach like > > building custom zonelists and other larger changes to the allocator? > > Hi Michal, > > I'm mostly done implementing this change. It looks good, and so far I think it's > functionally equivalent. One thing though, above you use NULL for the fallback. > That actually should not be NULL because of the logic in policy_node to restrict > zones, and obey cpusets. I've implemented it as such, but I was hoping someone > with a deeper understanding, and more experience can confirm that was the > correct thing to do. Cpusets are just plumbed into the allocator directly. Have a look at __cpuset_zone_allowed call inside get_page_from_freelist. Anyway functionally what you are looking for here is that the fallback allocation should be exactly as if there was no mempolicy in place. And that is expressed by NULL nodemask. The rest is done automagically... -- Michal Hocko SUSE Labs