Re: Propagating GFP_NOFS inside __vmalloc()

David Rientjes <rientjes@xxxxxxxxxx> · Mon, 15 Nov 2010 13:28:54 -0800 (PST)

On Mon, 15 Nov 2010, Ricardo M. Correia wrote:

> When __vmalloc() / __vmalloc_area_node() calls map_vm_area(), the latter can
> allocate pages with GFP_KERNEL despite the caller of __vmalloc having requested
> a more strict gfp mask.
> 
> We fix this by introducing a per-thread gfp_mask, similar to gfp_allowed_mask
> but which only applies to the current thread. __vmalloc_area_node() will now
> temporarily restrict the per-thread gfp_mask when it calls map_vm_area().
> 
> This new per-thread gfp mask may also be used for other useful purposes, for
> example, after thread creation, to make sure that certain threads
> (e.g. filesystem I/O threads) never allocate memory with certain flags (e.g.
> __GFP_FS or __GFP_IO).

I dislike this approach not only for its performance degradation in core 
areas like the page and slab allocators, but also because it requires full 
knowledge of the callchain to determine the gfp flags of the allocation.  
This will become nasty very quickly.

This proposal essentially defines an entirely new method for passing gfp 
flags to the page allocator when it isn't strictly needed.  I think the 
problem you're addressing can be done in one of two ways:

 - create lower-level functions in each arch that pass a gfp argument to 
   the allocator rather than hard-coded GFP_KERNEL, or

 - avoid doing anything other than GFP_KERNEL allocations for __vmalloc():
   the only current users are gfs2, ntfs, and ceph (the page allocator
   __vmalloc() can be discounted since it's done at boot and GFP_ATOMIC
   here has almost no chance of failing since the size is determined based 
   on what is available).

The first option really addresses the bug that you're running into and can 
be addressed in a relatively simple way by redefining current users of 
pmd_alloc_one(), for instance, as a form of a new lower-level 
__pmd_alloc_one():

	static inline pmd_t *__pmd_alloc_one(struct mm_struct *mm,
					unsigned long addr, gfp_t flags)
	{
        	return (pmd_t *)get_zeroed_page(flags);
	}

	static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
	{
        	return __pmd_alloc_one(GFP_KERNEL|__GFP_REPEAT);
	}

and then using __pmd_alloc_one() in the vmalloc path with the passed mask 
rather than pmd_alloc_one().  This _will_ be slightly intrusive because it 
will require fixing up some short callchains to pass the appropriate mask, 
that will be limited to the vmalloc code and arch code that currently does 
unconditional GFP_KERNEL allocations.  Both are bugs that you'll be 
addressing for each architecture, so the intrusiveness of that change has 
merit (and be sure to cc linux-arch@xxxxxxxxxxxxxxx on it as well).

I only mention the second option because passing GFP_NOFS to __vmalloc() 
for sufficiently large sizes has a much higher probability of failing if 
you're running into issues where GFP_KERNEL is causing synchronous 
reclaim.  We may not be able to do any better in the contexts in which 
gfs2, ntfs, and ceph use it without some sort of preallocation at an 
earlier time, but the liklihood of those allocations failing is much 
harder than the typical vmalloc() that tries really hard with __GFP_REPEAT 
to allocate the memory required.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>