On Tue 02-12-14 17:08:04, Tejun Heo wrote: > Hello, sorry about the delay. Was on vacation. > > Generally looks good to me. Some comments below. > > > @@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = { > > > > static void moom_callback(struct work_struct *ignored) > > { > > - out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL, > > - 0, NULL, true); > > + if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), > > + GFP_KERNEL, 0, NULL, true)) { > > + printk(KERN_INFO "OOM request ignored because killer is disabled\n"); > > + } > > } > > CodingStyle line 157 says "Do not unnecessarily use braces where a > single statement will do.". Sure. Fixed > > +/** > > + * oom_killer_disable - disable OOM killer > > + * > > + * Forces all page allocations to fail rather than trigger OOM killer. > > + * Will block and wait until all OOM victims are dead. > > + * > > + * Returns true if successfull and false if the OOM killer cannot be > > + * disabled. > > + */ > > +extern bool oom_killer_disable(void); > > And function comments usually go where the function body is, not where > the function is declared, no? Fixed > > @@ -157,27 +132,11 @@ int freeze_processes(void) > > pm_wakeup_clear(); > > printk("Freezing user space processes ... "); > > pm_freezing = true; > > - oom_kills_saved = oom_kills_count(); > > error = try_to_freeze_tasks(true); > > if (!error) { > > __usermodehelper_set_disable_depth(UMH_DISABLED); > > - oom_killer_disable(); > > - > > - /* > > - * There might have been an OOM kill while we were > > - * freezing tasks and the killed task might be still > > - * on the way out so we have to double check for race. > > - */ > > - if (oom_kills_count() != oom_kills_saved && > > - !check_frozen_processes()) { > > - __usermodehelper_set_disable_depth(UMH_ENABLED); > > - printk("OOM in progress."); > > - error = -EBUSY; > > - } else { > > - printk("done."); > > - } > > + printk("done.\n"); > > A delta but shouldn't it be pr_cont()? kernel/power/process.c doesn't use pr_* so I've stayed with what the rest of the file is using. I can add a patch which transforms all of them. > ... > > @@ -206,6 +165,18 @@ int freeze_kernel_threads(void) > > printk("\n"); > > BUG_ON(in_atomic()); > > > > + /* > > + * Now that everything freezable is handled we need to disbale > > + * the OOM killer to disallow any further interference with > > + * killable tasks. > > + */ > > + printk("Disabling OOM killer ... "); > > + if (!oom_killer_disable()) { > > + printk("failed.\n"); > > + error = -EAGAIN; > > + } else > > + printk("done.\n"); > > Ditto on pr_cont() and > > CodingStyle line 169 says "This does not apply > if only one branch of a conditional statement is a single statement; > in the latter case use braces in both branches:" Fixed > > @@ -251,6 +220,9 @@ void thaw_kernel_threads(void) > > { > > struct task_struct *g, *p; > > > > + printk("Enabling OOM killer again.\n"); > > Do we really need this printk? The same goes for Disabling OOM > killer. For freezing it makes some sense because freezing may take a > considerable amount of time and even occassionally fail due to > timeout. We aren't really expecting those to happen for OOM victims. I just considered them useful if there are follow up allocation failure messages to know that they are due to OOM killer. I can remove them. > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 302e0fc6d121..34bcbb053132 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2128,6 +2128,8 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > > current->memcg_oom.order = order; > > } > > > > +extern bool oom_killer_disabled; > > Ugh... don't we wanna put this in a header file? Who else would need the declaration? This is not something random code should look at. > > +void mark_tsk_oom_victim(struct task_struct *tsk) > > { > > - return atomic_read(&oom_kills); > > + BUG_ON(oom_killer_disabled); > > WARN_ON_ONCE() is prolly a better option here? Well, something fishy is going on when oom_killer_disabled is set and we mark new OOM victim. This is a clear bug. Why would be warning and a allow the follow up breakage? > > + if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) > > Can a task actually be selected as an OOM victim multiple times? AFAICS nothing prevents from global OOM and memcg OOM killers racing. > > + return; > > + atomic_inc(&oom_victims); > > } > > > > -void note_oom_kill(void) > > +void unmark_tsk_oom_victim(struct task_struct *tsk) > > { > > - atomic_inc(&oom_kills); > > + int count; > > + > > + if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE)) > > + return; > > Maybe test this inline in exit_mm()? e.g. > > if (test_thread_flag(TIF_MEMDIE)) > unmark_tsk_oom_victim(current); Why do you think testing TIF_MEMDIE in exit_mm is better? I would like to reduce the usage of the flag as much as possible. > Also, can the function ever be called by someone other than current? > If not, why would it take @task? Changed to use current only. If there is anybody who needs that we can change that later. I wanted to have it symmetric to mark_tsk_oom_victim but that is not that important. > > + > > + down_read(&oom_sem); > > + /* > > + * There is no need to signal the lasst oom_victim if there > > + * is nobody who cares. > > + */ > > + if (!atomic_dec_return(&oom_victims) && oom_killer_disabled) > > + complete(&oom_victims_wait); > > I don't think using completion this way is safe. Please read on. > > > + up_read(&oom_sem); > > } > > > > -void mark_tsk_oom_victim(struct task_struct *tsk) > > +bool oom_killer_disable(void) > > { > > - set_tsk_thread_flag(tsk, TIF_MEMDIE); > > + /* > > + * Make sure to not race with an ongoing OOM killer > > + * and that the current is not the victim. > > + */ > > + down_write(&oom_sem); > > + if (!test_tsk_thread_flag(current, TIF_MEMDIE)) > > + oom_killer_disabled = true; > > Prolly "if (TIF_MEMDIE) { unlock; return; }" is easier to follow. OK > > + > > + count = atomic_read(&oom_victims); > > + up_write(&oom_sem); > > + > > + if (count && oom_killer_disabled) > > + wait_for_completion(&oom_victims_wait); > > So, each complete() increments the done count and wait decs. The > above code works iff the complete()'s and wait()'s are always balanced > which usually isn't true in this type of wait code. Either use > reinit_completion() / complete_all() combos or wait_event(). Hmm, I thought that only a single instance of freeze_kernel_threads (which calls oom_killer_disable) can run at a time. But I am currently not sure that all paths are called under lock_system_sleep. I am not familiar with reinit_completion API. Is the following correct? [...] @@ -434,10 +434,23 @@ void unmark_tsk_oom_victim(struct task_struct *tsk) * is nobody who cares. */ if (!atomic_dec_return(&oom_victims) && oom_killer_disabled) - complete(&oom_victims_wait); + complete_all(&oom_victims_wait); up_read(&oom_sem); } [...] @@ -445,16 +458,23 @@ bool oom_killer_disable(void) * and that the current is not the victim. */ down_write(&oom_sem); - if (!test_tsk_thread_flag(current, TIF_MEMDIE)) - oom_killer_disabled = true; + if (test_thread_flag(TIF_MEMDIE)) { + up_write(&oom_sem); + return false; + } + + /* unmark_tsk_oom_victim is calling complete_all */ + if (!oom_killer_disable) + reinit_completion(&oom_victims_wait); + oom_killer_disabled = true; count = atomic_read(&oom_victims); up_write(&oom_sem); - if (count && oom_killer_disabled) + if (count) wait_for_completion(&oom_victims_wait); - return oom_killer_disabled; + return true; } > > + > > + return oom_killer_disabled; > > Maybe 0 / -errno is better choice as return values? I do not have problem to change this if you feel strong about it but true/false sounds easier to me and it allows the caller to decide what to report. If there were multiple reasons to fail then sure but that is not the case. > > +/** out_of_memory - tries to invoke OOM killer. > > Formatting? fixed > > + * @zonelist: zonelist pointer > > + * @gfp_mask: memory allocation flags > > + * @order: amount of memory being requested as a power of 2 > > + * @nodemask: nodemask passed to page allocator > > + * @force_kill: true if a task must be killed, even if others are exiting > > + * > > + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable() > > + * when it returns false. Otherwise returns true. > > + */ > > +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > > + int order, nodemask_t *nodemask, bool force_kill) > > +{ > > + bool ret = false; > > + > > + down_read(&oom_sem); > > + if (!oom_killer_disabled) { > > + __out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill); > > + ret = true; > > + } > > + up_read(&oom_sem); > > + > > + return ret; > > Ditto on return value. 0 / -EBUSY seem like a better choice to me. > > > @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void) > > { > > struct zonelist *zonelist; > > > > + down_read(&oom_sem); > > if (mem_cgroup_oom_synchronize(true)) > > - return; > > + goto unlock; > > > > zonelist = node_zonelist(first_memory_node, GFP_KERNEL); > > if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) { > > - out_of_memory(NULL, 0, 0, NULL, false); > > + if (!oom_killer_disabled) > > + __out_of_memory(NULL, 0, 0, NULL, false); > > oom_zonelist_unlock(zonelist, GFP_KERNEL); > > Is this a condition which can happen and we can deal with? With > userland fully frozen, there shouldn't be page faults which lead to > memory allocation, right? Except for racing OOM victims which were missed by try_to_freeze_tasks because they didn't get cpu slice to wake up from the freezer. The task would die on the way out from the page fault exception. I have updated the changelog to be more verbose about this. > Shouldn't we document how oom disable/enable is supposed to be used Well the API shouldn't be used outside of the PM freezer IMO. This is not a general API that other part of the kernel should be using. I can surely add more documentation for the PM usage though. I have rewritten the changelog: " As oom_killer_disable() is a full OOM barrier now we can postpone it in the PM freezer to later after all freezable user tasks are considered frozen (to freeze_kernel_threads). Normally there wouldn't be any unfrozen user tasks at this moment so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further\ allocation would fail - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet " And I've added: +/** + * oom_killer_disable - disable OOM killer + * + * Forces all page allocations to fail rather than trigger OOM killer. + * Will block and wait until all OOM victims are dead. + * + * The function cannot be called when there are runnable user tasks because + * the userspace would see unexpected allocation failures as a result. Any + * new usage of this function should be consulted with MM people. + * + * Returns true if successful and false if the OOM killer cannot be + * disabled. + */ bool oom_killer_disable(void) > (it only makes sense while the whole system is in quiescent state) > and at least trigger WARN_ON_ONCE() if the above code path gets > triggered while oom killer is disabled? I can add a WARN_ON(!test_thread_flag(tsk, TIF_MEMDIE)). Thanks for the review! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>