On Mon, 8 Jun 2015, Michal Hocko wrote: > On Mon 08-06-15 12:51:53, David Rientjes wrote: > > On Fri, 5 Jun 2015, Michal Hocko wrote: > > > > > > Nack, this is not the appropriate response to exit path livelocks. By > > > > doing this, you are going to start unnecessarily panicking machines that > > > > have panic_on_oom set when it would not have triggered before. If there > > > > is no reclaimable memory and a process that has already been signaled to > > > > die to is in the process of exiting has to allocate memory, it is > > > > perfectly acceptable to give them access to memory reserves so they can > > > > allocate and exit. Under normal circumstances, that allows the process to > > > > naturally exit. With your patch, it will cause the machine to panic. > > > > > > Isn't that what the administrator of the system wants? The system > > > is _clearly_ out of memory at this point. A coincidental exiting task > > > doesn't change a lot in that regard. Moreover it increases a risk of > > > unnecessarily unresponsive system which is what panic_on_oom tries to > > > prevent from. So from my POV this is a clear violation of the user > > > policy. > > > > > > > We rely on the functionality that this patch is short cutting because we > > rely on userspace to trigger oom kills. For system oom conditions, we > > must then rely on the kernel oom killer to set TIF_MEMDIE since userspace > > cannot grant it itself. (I think the memcg case is very similar in that > > this patch is short cutting it, but I'm more concerned for the system oom > > in this case because it's a show stopper for us.) > > Do you actually have panic_on_oops enabled? > CONFIG_PANIC_ON_OOPS_VALUE should be 0, I'm not sure why that's relevant. The functionality I'm referring to is that your patch now panics the machine for configs where /proc/sys/vm/panic_on_oom is set and the same scenario occurs as described above. You're introducing userspace breakage because you are using panic_on_oom in a way that it hasn't been used in the past and isn't described as working in the documentation. > > We want to send the SIGKILL, which will interrupt things like > > But this patch only changes the ordering of panic_on_oops vs. > fatal_signal_pending(current) shortcut. So it can possible affect the > case when the current is exiting during OOM. Is this the case that you > are worried about? > Yes, of course, the case specifically when the killed process is in the exit path due to a userspace oom kill and needs access to memory reserves to exit. That's needed because the machine is oom (typically the only time a non-buggy userspace oom handler would kill a process). This: check_panic_on_oom(constraint, gfp_mask, order, mpol_mask, NULL); ... if (current->mm && (fatal_signal_pending(current) || task_will_free_mem(current))) { mark_oom_victim(current); goto out; } is obviously buggy in that regard. We need to be able to give a killed process or an exiting process memory reserves so it may (1) allocate prior to handling the signal and (2) be assured of exiting once it is in the exit path. > The documentation clearly states: > " > If this is set to 1, the kernel panics when out-of-memory happens. > However, if a process limits using nodes by mempolicy/cpusets, > and those nodes become memory exhaustion status, one process > may be killed by oom-killer. No panic occurs in this case. > Because other nodes' memory may be free. This means system total status > may be not fatal yet. > > If this is set to 2, the kernel panics compulsorily even on the > above-mentioned. Even oom happens under memory cgroup, the whole > system panics. > > The default value is 0. > 1 and 2 are for failover of clustering. Please select either > according to your policy of failover. > " > > So I read this as a reliability feature to handle oom situation as soon > as possible. > A userspace process that is killed by userspace that simply needs memory to handle the signal and exit is not oom. We have always allowed current to have access to memory reserves to exit without triggering panic_on_oom. This is nothing new, and is not implied by the documentation to be the case. I'm not going to spend all day trying to convince you that you cannot change the semantics of sysctls that have existed for years with new behavior especially when users require that behavior to handle userspace kills while still keeping their machines running. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>