On Mon, 15 Nov 2010, Ricardo M. Correia wrote: > When __vmalloc() / __vmalloc_area_node() calls map_vm_area(), the latter can > allocate pages with GFP_KERNEL despite the caller of __vmalloc having requested > a more strict gfp mask. > > We fix this by introducing a per-thread gfp_mask, similar to gfp_allowed_mask > but which only applies to the current thread. __vmalloc_area_node() will now > temporarily restrict the per-thread gfp_mask when it calls map_vm_area(). > > This new per-thread gfp mask may also be used for other useful purposes, for > example, after thread creation, to make sure that certain threads > (e.g. filesystem I/O threads) never allocate memory with certain flags (e.g. > __GFP_FS or __GFP_IO). I dislike this approach not only for its performance degradation in core areas like the page and slab allocators, but also because it requires full knowledge of the callchain to determine the gfp flags of the allocation. This will become nasty very quickly. This proposal essentially defines an entirely new method for passing gfp flags to the page allocator when it isn't strictly needed. I think the problem you're addressing can be done in one of two ways: - create lower-level functions in each arch that pass a gfp argument to the allocator rather than hard-coded GFP_KERNEL, or - avoid doing anything other than GFP_KERNEL allocations for __vmalloc(): the only current users are gfs2, ntfs, and ceph (the page allocator __vmalloc() can be discounted since it's done at boot and GFP_ATOMIC here has almost no chance of failing since the size is determined based on what is available). The first option really addresses the bug that you're running into and can be addressed in a relatively simple way by redefining current users of pmd_alloc_one(), for instance, as a form of a new lower-level __pmd_alloc_one(): static inline pmd_t *__pmd_alloc_one(struct mm_struct *mm, unsigned long addr, gfp_t flags) { return (pmd_t *)get_zeroed_page(flags); } static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) { return __pmd_alloc_one(GFP_KERNEL|__GFP_REPEAT); } and then using __pmd_alloc_one() in the vmalloc path with the passed mask rather than pmd_alloc_one(). This _will_ be slightly intrusive because it will require fixing up some short callchains to pass the appropriate mask, that will be limited to the vmalloc code and arch code that currently does unconditional GFP_KERNEL allocations. Both are bugs that you'll be addressing for each architecture, so the intrusiveness of that change has merit (and be sure to cc linux-arch@xxxxxxxxxxxxxxx on it as well). I only mention the second option because passing GFP_NOFS to __vmalloc() for sufficiently large sizes has a much higher probability of failing if you're running into issues where GFP_KERNEL is causing synchronous reclaim. We may not be able to do any better in the contexts in which gfs2, ntfs, and ceph use it without some sort of preallocation at an earlier time, but the liklihood of those allocations failing is much harder than the typical vmalloc() that tries really hard with __GFP_REPEAT to allocate the memory required. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>