On Sun, Mar 01, 2015 at 08:43:22AM -0500, Theodore Ts'o wrote: > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > > Overestimating should be fine, the result would a bit of false memory > > pressure. But underestimating and looping can't be an option or the > > original lockups will still be there. We need to guarantee forward > > progress or the problem is somewhat mitigated at best - only now with > > quite a bit more complexity in the allocator and the filesystems. > > We've lived with looping as it is and in practice it's actually worked > well. I can only speak for ext4, but I do a lot of testing under very > high memory pressure situations, and it is used in *production* under > very high stress situations --- and the only time we'e run into > trouble is when the looping behaviour somehow got accidentally > *removed*. > > There have been MM experts who have been worrying about this situation > for a very long time, but honestly, it seems to be much more of a > theoretical than actual concern. Well, looping is a valid thing to do in most situations because on a loaded system there is a decent chance that an unrelated thread will volunteer some unreclaimable memory, or exit altogether. Right now, we rely on this happening, and it works most of the time. Maybe all the time, depending on how your machine is used. But when it does't, machines do lock up in practice. We had these lockups in cgroups with just a handful of threads, which all got stuck in the allocator and there was nobody left to volunteer unreclaimable memory. When this was being addressed, we knew that the same can theoretically happen on the system-level but weren't aware of any reports. Well now, here we are. It's been argued in this thread that systems shouldn't be pushed to such extremes in real life and that we simply expect failure at some point. If that's the consensus, then yes, we can stop this and tell users that they should scale back. But I'm not convinced just yet that this is the best we can do. > So if you don't want to get hints/estimates about how much memory > the file system is about to use, when the file system is willing to > wait or even potentially return ENOMEM (although I suspect starting > to return ENOMEM where most user space application don't expect it > will cause more problems), I'm personally happy to just use > GFP_NOFAIL everywhere --- or to hard code my own infinite loops if > the MM developers want to take GFP_NOFAIL away. Because in my > experience, looping simply hasn't been as awful as some folks on > this thread have made it out to be. As I've said before, I'd be happy to get estimates from the filesystem so that we can adjust our reserves, instead of simply running against the wall at some point and hoping that the OOM killer heuristics will save the day. Until then, I'd much prefer __GFP_NOFAIL over open-coded loops. If the OOM killer is too aggressive, we can tone it down, but as it stands that mechanism is the last attempt at forward progress if looping doesn't work out. In addition, when we finally transition to private memory reserves, we can easily find the callsites that need to be annotated with __GFP_MAY_DIP_INTO_PRIVATE_RESERVES. > So if you don't like the complexity because the perfect is the enemy > of the good, we can just drop this and the file systems can simply > continue to loop around their memory allocation calls... or if that > fails we can start adding subsystem specific mempools, which would be > even more wasteful of memory and probably at least as complicated. It really depends on what the goal here is. You don't have to be perfectly accurate, but if you can give us a worst-case estimate we can actually guarantee forward progress and eliminate these lockups entirely, like in the block layer. Sure, there will be bugs and the estimates won't be right from the start, but we can converge towards the right answer. If the allocations which are allowed to dip into the reserves - the current nofail sites? - can be annotated with a gfp flag, we can easily verify the estimates by serving those sites exclusively from the private reserve pool and emit warnings when that runs dry. We wouldn't even have to stress the system for that. But there are legitimate concerns that this might never work. For example, the requirements could be so unpredictable, or assessing them with reasonable accuracy could be so expensive, that the margin of error would make the worst case estimate too big to be useful. Big enough that the reserves would harm well-behaved systems. And if useful worst-case estimates are unattainable, I don't think we need to bother with reserves. We can just stick with looping and OOM killing, that works most of the time, too. _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs