On Sun, Mar 01, 2015 at 08:43:22AM -0500, Theodore Ts'o wrote: > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > > Overestimating should be fine, the result would a bit of false memory > > pressure. But underestimating and looping can't be an option or the > > original lockups will still be there. We need to guarantee forward > > progress or the problem is somewhat mitigated at best - only now with > > quite a bit more complexity in the allocator and the filesystems. > > We've lived with looping as it is and in practice it's actually worked > well. I can only speak for ext4, but I do a lot of testing under very > high memory pressure situations, and it is used in *production* under > very high stress situations --- and the only time we'e run into > trouble is when the looping behaviour somehow got accidentally > *removed*. Memory is a finite resource and there are (unlimited) consumers that do not allow their share to be reclaimed/recycled. Mainly this is the kernel itself, but it also includes anon memory once swap space runs out, as well as mlocked and dirty memory. It's not a question of whether there exists a true point of OOM (where not enough memory is recyclable to satisfy new allocations). That point inevitably exists. It's a policy question of how to inform userspace once it is reached. We agree that we can't unconditionally fail allocations, because we might be in the middle of a transaction, where an allocation failure can potentially corrupt userdata. However, endlessly looping for progress that can not happen at this point has the exact same effect: the transaction won't finish. Only the machine locks up in addition. It's great that your setups don't ever truly go out of memory, but that doesn't mean it can't happen in practice. One answer to users at this point could certainly be to stay away from the true point of OOM, and if you don't then that's your problem. But the issue I take with this answer is that, for the sake of memory utilization, users kind of do want to get fairly close to this point, and at the same time it's hard to reliably predict the memory consumption of a workload in advance. It can depend on the timing between threads, it can depend on user/network-supplied input, and it can simply be a bug in the application. And if that OOM situation is accidentally entered, I'd prefer we had a better answer than locking up the machine and blame the user. So one attempt to make progress in this situation is to kill userspace applications that are pinning unreclaimable memory. This is what we are doing now, but there are several problems with it. For one, we are doing a terrible job and might still get stuck sometimes, which deteriorates the situation back to failing the allocation and corrupting the filesystem. Secondly, killing tasks is disruptive, and because it's driven by heuristics we're never going to kill the "right" one in all situations. Reserves would allow us to look ahead and avoid starting transactions that can not be finished given the available resources. So we are at least avoiding filesystem corruption. The tasks could probably be put to sleep for some time in the hope that ongoing transactions complete and release memory, but there might not be any, and eventually the OOM situation has to be communicated to userspace. Arguably, an -ENOMEM from a syscall at this point might be easier to handle than a SIGKILL from the OOM killer in an unrelated task. So if we could pull off reserves, they look like the most attractive solution to me. If not, the OOM killer needs to be fixed to always make forward progress instead. I proposed a patch for that already. But infinite loops that force the user to reboot the machine at the point of OOM seem like a terrible policy. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>