On Fri 29-01-16 16:56:45, Andrea Arcangeli wrote: > On Fri, Jan 29, 2016 at 03:38:06PM +0100, Michal Hocko wrote: > > That would require the oom victim to release the memory and drop > > TIF_MEMDIE before we go out_of_memory again. And that might happen > > anytime whether we are holding oom_trylock or not because it doesn't > > synchronize the exit path. So we are basically talking about: > > > > should_alloc_retry > > [1] > > get_page_from_freelist(ALLOC_WMARK_HIGH) > > [2] > > out_of_memory > > > > and the race window for 1 is much smaller than 2 because [2] is quite > > [1] is before should_alloc_retry is set. It covers the entire reclaim. > > > costly operation. I wonder if this last moment request ever succeeds. I > > have run my usual oom flood tests and it hasn't shown up a single time. > > For this check to make a difference, you need a lot of small programs > all hitting OOM at the same time. That is essentially my oom flood testing program doing. Spawning hundreds of paralell anon mem eaters. > Perhaps the trylock on the oom_lock > tends to hide the race like you suggested earlier but it doesn't sound > accurate if we proceed to oom kill without checking the high wmark at all > before killing another task after a random reclaim failure. The thing is that the reclaim would have to reclaim consistently after the rework. > Also note there's no CPU to save here, this is a very slow path, > anything that can increase accuracy and avoid OOM kill false > positives (at practical zero CPU cost like here) sounds worth it. Sure my idea wasn't to save the CPU. The code was just a head scratcher and an attempt for a clean up. We have more places where we keep some heuristics just because we have them since ever and it is hard to judge what effect they have exactly. > > That being said I do not care that much. I just find this confusing and > > basically pointless because the whole thing is racy by definition and we > > are trying to cover a smaller window. I would understand if we did such > > a last attempt right before we are going to kill a selected victim. This > > would cover much larger race window. > > The high wmark itself is still an arbitrary value so yes, it isn't > perfect, but the whole OOM killing is an heuristic, so tiny race > window to me sounds better than huge race window. > > Moving this check down inside out_of_memory to reduce the window even > further is quite a different proposition than removing the check. > > Currently we're doing this check after holding the oom_lock, back in > 2.6.x it was more more racy, now thanks to the oom_lock it's way more > reliable. If you want to increase reliability further I sure agree, > but removing the check would drop reliability instead so I don't see > how it could be preferable. > > We can increase reliability further if we'd move this high wmark check > after select_bad_process() returned a task (and not -1UL) to be sure > all TIF_MEMDIE tasks already were flushed out, before checking the > high wmark. Just it would complicate the code and that's probably why > it wasn't done. This would be mixing two layers which is not nice. Johannes was proposing a different approach [1] which sounds much better to me. [1] http://lkml.kernel.org/r/20160128235110.GA5805@xxxxxxxxxxx -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>