On Mon, 15 Nov 2010, Ricardo M. Correia wrote: > > This proposal essentially defines an entirely new method for passing gfp > > flags to the page allocator when it isn't strictly needed. > > I don't see it as a way of passing gfp flags to the page allocator, but > rather as a way of restricting which gfp flags can be used (on a > per-thread basis). I agree it's not strictly needed. > It restricts the flags that can be used in the current context, but that's the whole purpose of gfp arguments such as __GFP_FS, __GFP_IO, and __GFP_WAIT that control the reclaim behavior of the page allocator in the first place: we want to use all strategies at our disposal that are allowed in the current context to allocate the memory. Using this interface to restrict only to __GFP_WAIT, for example, is the same as passing __GFP_WAIT. > > The first option really addresses the bug that you're running into and can > > be addressed in a relatively simple way by redefining current users of > > pmd_alloc_one(), for instance, as a form of a new lower-level > > __pmd_alloc_one(): > > > > static inline pmd_t *__pmd_alloc_one(struct mm_struct *mm, > > unsigned long addr, gfp_t flags) > > { > > return (pmd_t *)get_zeroed_page(flags); > > } > > > > static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) > > { > > return __pmd_alloc_one(GFP_KERNEL|__GFP_REPEAT); > > } > > > > and then using __pmd_alloc_one() in the vmalloc path with the passed mask > > rather than pmd_alloc_one(). > > Ok, this sounds OK in theory (I'd have to change all these > implementations for all architectures, but oh well..). But I have a > question about your proposal.. the call chain seems to go: > Yes, all architectures would need to be changed because they all unconditionally do GFP_KERNEL allocations which is buggy in the vmalloc case, so the breadth of the change is certainly warranted. > vmap_pmd_range() > pmd_alloc() > __pmd_alloc() > pmd_alloc_one() > > So you want to change this so that vmap_pmd_range() calls your new > __pmd_alloc_one() instead of pmd_alloc_one(). > Yes, with the gfp flags that were passed to __vmalloc() to restrict the lower-level allocations that you cited in your original email (pmd_alloc_one() and pte_alloc_one_kernel()). > But this means that pmd_alloc() and __pmd_alloc() would need to accept > the flag as well, right? > > Probably we should define an alias for pmd_alloc() so that we don't have > to change all its callers to pass a flag. But __pmd_alloc() is already > taken, so what would you suggest we call it? _pmd_alloc()? :-) > Yeah, that's why I said it's slightly more intrusive than the single example I gave, you need to modify the callchain to pass the flags in other functions as well. Instead of extending the __*() functions with more underscores like other places in the kernel (see mm/slab.c, for instance), I'd suggest just appending _gfp() to their name so __pmd_alloc() uses a new __pmd_alloc_gfp(). > Indeed... I don't think we can do any better in Lustre either. The > preallocation also doesn't guarantee anything for us, because the amount > of memory that we may need to allocate can be huge in the worst case, > and it would be prohibitive to preallocate it. > I think it really depends on how much memory you plan to allocate without allowing direct reclaim (and that also prevents the oom killer from being used as well). The ntfs use, for instance, falls back to vmalloc if the size is greater than PAGE_SIZE to avoid high-order allocations from failing due to fragmentation and memory compaction can't be used for GFP_NOFS either. The best the VM can do is allocate the order-0 pages with GFP_NOFS which has a much higher liklihood that it won't succeed, so it's really to gfs2, ntfs, and ceph's detriment that it doesn't have a workaround. > For our case, I'd think it's better to either handle failure or somehow > retry until the allocation succeeds (if we know for sure that it will, > eventually). > If your use-case is going to block until this memory is available, there's a serious problem that you'll need to address because nothing is going to guarantee that memory will be freed unless something else is trying to allocate memory and pages get written back or something gets killed as a result. Strictly relying on that behavior is concerning, but it's not something that can be fixed in the VM. > > but the liklihood of those allocations failing is much > > harder than the typical vmalloc() that tries really hard with __GFP_REPEAT > > to allocate the memory required. > > Not sure what do you mean by this.. I don't see a typical vmalloc() > using __GFP_REPEAT anywhere (apart from functions such as > pmd_alloc_one(), which in the code above you suggested to keep passing > __GFP_REPEAT).. am I missing something? > __GFP_REPEAT will retry the allocation indefinitely until the needed amount of memory is reclaimed without considering the order of the allocation; all orders of interest in your case are order-0, so it will loop indefinitely until a single page is reclaimed which won't happen with GFP_NOFS. Thus, passing the flag is the equivalent of asking the allocator to loop forever until memory is available rather than failing and returning to your error handling. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>