Re: [PATCH 00/18] multiple preferred nodes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> On 20-06-23 13:20:48, Michal Hocko wrote:
[...]
> > It would be also great to provide a high level semantic description
> > here. I have very quickly glanced through patches and they are not
> > really trivial to follow with many incremental steps so the higher level
> > intention is lost easily.
> > 
> > Do I get it right that the default semantic is essentially
> > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > 	  semantic)
> > 	- fallback to numa unrestricted allocation with the default
> > 	  numa policy on the failure
> > 
> > Or are there any usecases to modify how hard to keep the preference over
> > the fallback?
> 
> tl;dr is: yes, and no usecases.

OK, then I am wondering why the change has to be so involved. Except for
syscall plumbing the only real change to the allocator path would be
something like

static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
{
	/* Lower zones don't get a nodemask applied for MPOL_BIND */
	if (unlikely(policy->mode == MPOL_BIND || 
	   	     policy->mode == MPOL_PREFERED_MANY) &&
			apply_policy_zone(policy, gfp_zone(gfp)) &&
			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
		return &policy->v.nodes;

	return NULL;
}

alloc_pages_current

	if (pol->mode == MPOL_INTERLEAVE)
		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
	else {
		gfp_t gfp_attempt = gfp;

		/*
		 * Make sure the first allocation attempt will try hard
		 * but eventually fail without OOM killer or other
		 * disruption before falling back to the full nodemask
		 */
		if (pol->mode == MPOL_PREFERED_MANY)
			gfp_attempt |= __GFP_RETRY_MAYFAIL;	

		page = __alloc_pages_nodemask(gfp_attempt, order,
				policy_node(gfp, pol, numa_node_id()),
				policy_nodemask(gfp, pol));
		if (!page && pol->mode == MPOL_PREFERED_MANY)
			page = __alloc_pages_nodemask(gfp, order,
				numa_node_id(), NULL);
	}

	return page;

similar (well slightly more hairy) in alloc_pages_vma

Or do I miss something that really requires more involved approach like
building custom zonelists and other larger changes to the allocator?
-- 
Michal Hocko
SUSE Labs



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux