On Tue, Dec 03, 2013 at 01:04:54PM +0100, Michal Hocko wrote: > On Mon 02-12-13 16:25:00, Johannes Weiner wrote: > > On Mon, Dec 02, 2013 at 09:02:21PM +0100, Michal Hocko wrote: > [...] > > > But we are not talking just about races here. What if the OOM is a > > > result of an OOM action itself. E.g. a killed task faults a memory in > > > while exiting and it hasn't freed its memory yet. Should we notify in > > > such a case? What would an userspace OOM handler do (the in-kernel > > > implementation has an advantage because it can check the tasks flags)? > > > > We don't notify in such a case. Every charge from a TIF_MEMDIE or > > exiting task is bypassing the limit immediately. Not even reclaim. > > Not really. Assume a memcg is under OOM. A task is killed by > userspace so we get into signal delivery code which clears > fatal_signal_pending and the code goes on to exit but then it faults in. > __mem_cgroup_try_charge will not see signal pending and TIF_MEMDIE is > not set yet. OOM is still not resolved so we are back to square one. Ah, that's a completely separate problem, though. One issue I have with these checks is that I never know which one are shortcuts and micro optimizations and which one are functionally necessary. > > > > So again, I don't see this patch is doing anything but blur the > > > > current line and make notification less predictable. And, as someone > > > > else in this thread already said, it's a uservisible change in > > > > behavior and would break known tuning usecases. > > > > > > I would like to understand how would such a tuning usecase work and how > > > it would break with this change. > > > > I would do test runs and with every run increase the size of the > > workload until I get OOM notifications to know when the kernel has > > been pushed beyond its limits and available memory + reclaim > > capability can't keep up with the workload anymore. > > > > Not informing me just because due to timing variance a random process > > exits in the last moment would be flat out lying. The machine is OOM. > > Many reclaim cycles failing is a good predictor. Last minute exit of > > random task is not, it's happenstance and I don't want to rely on a > > fluke like this to size my workload. > > Such a metric would be inherently racy for the same reason. You simply > cannot rely on not seeing OOMs because an exiting task managed to leave > in time (after MEM_CGROUP_RECLAIM_RETRIES direct reclaim loops and > before mem_cgroup_oom). Difference between in time and little bit too > late is just too fragile to be useful IMO. Are we saying the same thing? Or did I misunderstand you? > > > Consider the above example. You would get 2 notification for the very > > > same OOM condition. > > > On the other hand if the encountered exiting task was just a race then > > > we have two options basically. Either there are more tasks racing (and > > > not all of them are exiting) or there is only one (all are exiting). > > > We will not loose any notification in the first case because the flags > > > are checked before mem_cgroup_oom_trylock and so one of tasks would lock > > > and notify. > > > The second case is more interesting. Userspace won't get notification > > > but we also know that no action is required as the OOM will be resolved > > > by itself. And now we should consider whether notification would do more > > > good than harm. The tuning usecase would loose one event. Would such a > > > rare situation skew the statistics so much? On the other hand a real OOM > > > killer would do something which means something will be killed. I find > > > the later much worse. > > > > We already check in various places (sigh) for whether reclaim and > > killing is still necessary. What is the end game here? An endless > > loop right before the kill where we check if the kill is still > > necessary? > > The patch as is doesn't cover all the cases and ideally we should check > that for OOM_SCAN_ABORT and later in oom_kill_process because they can > back out as well if we want to have only-on-action notification. Such a > solution would be too messy though. There is never an only-on-action notification and there is no check to cover all cases. This is the fallacy I'm trying to point out. All we can do is pick a fairly predictable line and stick to it. > But as I've said. The primary reason I liked this change is because it > solves the above mentioned OOM during exit issue and it also prevents > from a pointless notification. I am perfectly fine with moving the > check+set TIF_MEMDIE down so solve only the issue #1 and do not mess > with notifications. The notification is not pointless, we are OOM at the time we make the decision. > > You're not fixing this problem, so why make the notifications less > > reliable? > > I am still not seeing why it is less reliable. The notification is > inherently racy so you cannot rely on any simple metrics based on their > count (at least not in general). It should be based on reclaim failing, which is more predictable and reproducible across multiple runs. Anything else is just random coincidence, I can't fathom why you think that it's at all meaningful. You could do one more reclaim cycle and a charge attempt, or even just add an msleep() between the last reclaim cycle and the last charge attempt for exactly the same outcome. > > > So all in all. I do agree with you that this path will never be race > > > free and without pointless OOM actions. I also agree that drawing the > > > line is hard. But I am more inclined to prevent from notification when > > > we already know that _no action_ is required because IMHO the vast > > > majority of oom listeners are there to _do_ an action which is mostly > > > deadly. > > > > If you want to push the machine so hard that active measures like > > reclaim can't keep up and you rely on stupid timing like this to save > > your sorry butt, then you'll just have to live with the > > unpredictability of it. You're going to eat kills that might have > > been avoided last minute either way. It's no excuse to plaster the MM > > with TIF_MEMDIE checks and last-minute cgroup margin checks in the > > weirdest locations. > > Yes I do not agree with putting TIF_MEMDIE checks all over the place and > we should reduce their number to minimum. It is fair to say that the > patch didn't add a new check. It just has moved it to cover both > in-kernel and user space oom paths. That was a bonus I liked. To be > honest I do not see the notification side effect as a big deal as those > are racy anyway and I would rather see fewer of them than more > (especially when it is clear that nothing is to be done). I don't know why we have this check there in the first place, it should be part of the OOM victim selection process and not a hacky shortcut in memcg. It's a blatant layering violation for no obvious reason. > > Again, how likely is it anyway that the kill was truly skipped and not > > just deferred? Reclaim failing is a good indicator that you're in > > trouble, a random task exiting in an ongoing workload does not say > > much. The machine could still be in trouble, so you just deferred the > > inevitable, you didn't really avoid a kill. > > > > At this point we are talking about OOM kill frequency and statistical > > probability during apparently normal operations. The OOM killer was > > never written for that, it was supposed to be a last minute resort > > that should not occur during normal operations and only if all SANE > > measures to avoid it have failed. 99% of all users have no interest > > in these micro-optimizations and we shouldn't clutter the code and > > have unpredictable behavior without even a trace of data to show that > > this is anything more than a placebo measure for one use case. > > OK, as it seems that the notification part is too controversial, how > would you like the following? It reverts the notification part and still > solves the fault on exit path. I will prepare the full patch with the > changelog if this looks reasonable: > --- > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 28c9221b74ea..f44fe7e65a98 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1783,6 +1783,16 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > unsigned int points = 0; > struct task_struct *chosen = NULL; > > + /* > + * If current has a pending SIGKILL or is exiting, then automatically > + * select it. The goal is to allow it to allocate so that it may > + * quickly exit and free its memory. > + */ > + if (fatal_signal_pending(current) || current->flags & PF_EXITING) { > + set_thread_flag(TIF_MEMDIE); > + goto cleanup; return; Anyway, I wish this would not be here at all and part of the thread scanner. As David said, this is a very cold path, why the shortcut? > @@ -2233,16 +2243,6 @@ bool mem_cgroup_oom_synchronize(bool handle) > if (!handle) > goto cleanup; > > - /* > - * If current has a pending SIGKILL or is exiting, then automatically > - * select it. The goal is to allow it to allocate so that it may > - * quickly exit and free its memory. > - */ > - if (fatal_signal_pending(current) || current->flags & PF_EXITING) { > - set_thread_flag(TIF_MEMDIE); > - goto cleanup; > - } > - > owait.memcg = memcg; > owait.wait.flags = 0; > owait.wait.func = memcg_oom_wake_function; > @@ -2266,6 +2266,13 @@ bool mem_cgroup_oom_synchronize(bool handle) > schedule(); > mem_cgroup_unmark_under_oom(memcg); > finish_wait(&memcg_oom_waitq, &owait.wait); > + > + /* Userspace OOM handler cannot set TIF_MEMDIE to a target */ > + if (memcg->oom_kill_disable) { > + if ((fatal_signal_pending(current) || > + current->flags & PF_EXITING)) > + set_thread_flag(TIF_MEMDIE); > + } This is an entirely different change that I think makes sense. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>