Re: [PATCH 4/11] hugetlb: derive huge pages nodes allowed from task mempolicy

David Rientjes <rientjes@xxxxxxxxxx> · Wed, 7 Oct 2009 13:09:32 -0700 (PDT)

On Wed, 7 Oct 2009, Lee Schermerhorn wrote:

> > > Index: linux-2.6.31-mmotm-090925-1435/mm/hugetlb.c
> > > ===================================================================
> > > --- linux-2.6.31-mmotm-090925-1435.orig/mm/hugetlb.c	2009-09-30 12:48:45.000000000 -0400
> > > +++ linux-2.6.31-mmotm-090925-1435/mm/hugetlb.c	2009-10-02 21:22:04.000000000 -0400
> > > @@ -1334,29 +1334,71 @@ static struct hstate *kobj_to_hstate(str
> > >  	return NULL;
> > >  }
> > >  
> > > -static ssize_t nr_hugepages_show(struct kobject *kobj,
> > > +static ssize_t nr_hugepages_show_common(struct kobject *kobj,
> > >  					struct kobj_attribute *attr, char *buf)
> > >  {
> > >  	struct hstate *h = kobj_to_hstate(kobj);
> > >  	return sprintf(buf, "%lu\n", h->nr_huge_pages);
> > >  }
> > > -static ssize_t nr_hugepages_store(struct kobject *kobj,
> > > -		struct kobj_attribute *attr, const char *buf, size_t count)
> > > +static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
> > > +			struct kobject *kobj, struct kobj_attribute *attr,
> > > +			const char *buf, size_t len)
> > >  {
> > >  	int err;
> > > -	unsigned long input;
> > > +	unsigned long count;
> > >  	struct hstate *h = kobj_to_hstate(kobj);
> > > +	NODEMASK_ALLOC(nodemask, nodes_allowed);
> > >  
> > 
> > In the two places you do NODEMASK_ALLOC(), here and 
> > hugetlb_sysctl_handler(), you'll need to check that nodes_allowed is 
> > non-NULL since it's possible that kmalloc() will return NULL for 
> > CONFIG_NODES_SHIFT > 8.
> > 
> > In such a case, it's probably sufficient to simply set nodes_allowed to 
> > node_states[N_HIGH_MEMORY] so that we can still free hugepages when we're 
> > oom, a common memory freeing tactic.
> > 
> > You could do that by simply returning false from 
> > init_nodemask_of_mempolicy() if !nodes_allowed since NODEMASK_FREE() can 
> > take a NULL pointer, but it may be easier to factor that logic into your 
> > conditional below:
> > 
> > > -	err = strict_strtoul(buf, 10, &input);
> > > +	err = strict_strtoul(buf, 10, &count);
> > >  	if (err)
> > >  		return 0;
> > >  
> > > -	h->max_huge_pages = set_max_huge_pages(h, input, &node_online_map);
> > > +	if (!(obey_mempolicy && init_nodemask_of_mempolicy(nodes_allowed))) {
> > > +		NODEMASK_FREE(nodes_allowed);
> > > +		nodes_allowed = &node_online_map;
> > > +	}
> > > +	h->max_huge_pages = set_max_huge_pages(h, count, nodes_allowed);
> > >
> > 
> > You can get away with just testing !nodes_allowed here since the stack 
> > allocation variation of NODEMASK_ALLOC() is such that nodes_allowed will 
> > always be an initialized pointer pointing to _nodes_allowed so you won't 
> > have an uninitialized warning.
> > 
> > Once that's done, you can get rid of the check for a NULL nodes_allowed in 
> > try_to_free_low() from patch 2 since it will always be valid in 
> > set_max_huge_pages().
> 
> 
> OK.  already removed the NULL check from try_to_free_low().  And I made
> the change to init_nodemask_of_mempolicy to return false on NULL mask.
> 
> I'm not completely happy with dropping back to default behavior
> [node_online_map here, replaced with node_states[N_HIGH_MEMORY] in
> subsequent patch] on failure to allocate nodes_allowed.  We only do the
> NODEMASK_ALLOC when we've come in from either nr_hugepages_mempolicy or
> a per node attribute [subsequent patch], so I'm not sure that ignoring
> the mempolicy, if any, or the specified node id, is a good thing here.
> Not silently, at least.  I haven't addressed this, yet.  We can submit
> an incremental patch.  Thoughts?
> 

Hmm, it's debatable since the NODEMASK_ALLOC() slab allocation is 
GFP_KERNEL which would cause direct reclaim (and perhaps even the oom 
killer) to free memory.  If the oom killer were invoked, current would 
probably even be killed because of how the oom killer works for 
CONSTRAINT_MEMORY_POLICY.  So the end result is that the pages would 
eventually be freed because current would get access to memory reserves 
via TIF_MEMDIE but would die immediately after returning.  It was nice of 
current to sacrifice itself like that.

Unfortunately, I think the long term solution is that NODEMASK_ALLOC() is 
going to require a gfp parameter to pass to kmalloc() and in this case we 
should union __GFP_NORETRY.  Then, if nodes_allowed can't be allocated I 
think it would be better to simply return -ENOMEM to userspace so it can 
either reduce the number of global hugepages or free memory in another 
way.  (There might be a caveat where the user's mempolicy already includes 
all online nodes and they use nr_hugepages_mempolicy where they couldn't 
free hugepages because of -ENOMEM but could via nr_hugepages, but I don't 
think you need to address that.)

The worst case allocation is probably 512 bytes for CONFIG_NODES_SHIFT of 
12 so I don't think using __GFP_NORETRY here is going to be that 
ridiculous.
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html