On Sat, Feb 28, 2015 at 11:03:59AM +1100, Dave Chinner wrote: > > I think the best way is if slab could also learn to provide reserves for > > individual objects. Either just mark internally how many of them are reserved, > > if sufficient number is free, or translate this to the page allocator reserves, > > as slab knows which order it uses for the given objects. > > Which is effectively what a slab based mempool is. Mempools don't > guarantee a reserve is available once it's been resized, however, > and we'd have to have mempools configured for every type of > allocation we are going to do. So from that perspective it's not > really a solution. The bigger problem is it means that the upper layer which is making the reservation before it starts taking lock won't necessarily know exactly which slab objects it and all of the lower layers might need. So it's much more flexible, and requires less accuracy, if we can just request that (a) the mm subsystems reserves at least N pages, and (b) tell it that at this point in time, it's safe for the requesting subsystem to block until N pages is available. Can this be guaranteed to be accurate? No, of course not. And in some cases, it may be possible since it might depend on whether the iSCSI device needs to reconnect to the target, or some sort of exception handling, before it can complete its I/O request. But it's better than what we have now, which is that once we've taken certain locks, and/or started a complex transaction, we can't really back out, so we end up looping either using GFP_NOFAIL, or around the memory allocation request if there are still mm developers who are delusional enough to believe, ala like King Canute, to say, "You must always be able to handle memory allocation at any point in the kernel and GFP_NOFAIL is an indicatoin of a subsystem bug!" I can imagine using some adjustment factors, where a particular voratious device might require hint to the file system to boost its memory allocation estimate by 30%, or 50%. So yes, it's a very, *very* rough estimate. And if we guess wrong, we might end up having to loop ala GFP_NOFAIL anyway. But it's better than not having such an estimate. I also grant that this doesn't work very well for emergency writeback, or background writeback, where we can't and shouldn't block waiting for enough memory to become free, since page cleaning is one of the ways that we might be able to make memory available. But if that's the only problem we have, we're in good shape, since that can be solved by either (a) doing a better job throttling memory allocations or memory reservation requests in the first place, and/or (b) starting the background writeback much more aggressively and earlier. - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>