Oleg Nesterov wrote: > On 09/22, Tetsuo Handa wrote: > > > > I imagined a dedicated kernel thread doing something like shown below. > > (I don't know about mm->mmap management.) > > mm->mmap_zapped corresponds to MMF_MEMDIE. > > No, it doesn't, please see below. > > > bool has_sigkill_task; > > wait_queue_head_t kick_mm_zapper; > > OK, if this kthread is kicked by oom this makes more sense, but still > doesn't look right at least initially. Yes, I meant this kthread is kicked upon sending SIGKILL. But I forgot that > > Let me repeat, I do think we need MMF_MEMDIE or something like it before > we do something more clever. And in fact I think this flag makes sense > regardless. > > > static void mm_zapper(void *unused) > > { > > struct task_struct *g, *p; > > struct mm_struct *mm; > > > > sleep: > > wait_event(kick_remover, has_sigkill_task); > > has_sigkill_task = false; > > restart: > > rcu_read_lock(); > > for_each_process_thread(g, p) { > > if (likely(!fatal_signal_pending(p))) > > continue; > > task_lock(p); > > mm = p->mm; > > if (mm && mm->mmap && !mm->mmap_zapped && down_read_trylock(&mm->mmap_sem)) { > ^^^^^^^^^^^^^^^ > > We do not want mm->mmap_zapped, it can't work. We need mm->needs_zap > set by oom_kill_process() and cleared after zap_page_range(). > > Because otherwise we can not handle CLONE_VM correctly. Suppose that > an innocent process P does vfork() and the child is killed but not > exited yet. mm_zapper() can find the child, do zap_page_range(), and > surprise its alive parent P which uses the same ->mm. kill(P's-child, SIGKILL) does not kill P sharing the same ->mm. Thus, mm_zapper() can be used for only OOM-kill case and test_tsk_thread_flag(p, TIF_MEMDIE) should be used than fatal_signal_pending(p). > > And if we rely on MMF_MEMDIE or mm->needs_zap or whaveter then > for_each_process_thread() doesn't really make sense. And if we have > a single MMF_MEMDIE process (likely case) then the unconditional > _trylock is suboptimal. I guess the more likely case is that the OOM victim successfully exits before mm_zapper() finds it. I thought that a dedicated kernel thread which scans the task list can do deferred zapping by automatically retrying (in a few seconds interval ?) when down_read_trylock() failed. > > Tetsuo, can't we do something simple which "obviously can't hurt at > least" and then discuss the potential improvements? No problem. I can wait for your version. > > And yes, yes, the "Kill all user processes sharing victim->mm" logic > in oom_kill_process() doesn't 100% look right, at least wrt the change > we discuss. If we use test_tsk_thread_flag(p, TIF_MEMDIE), we will need to set TIF_MEMDIE to the victim after sending SIGKILL to all processes sharing the victim's mm. Well, the likely case that the OOM victim exits before mm_zapper() finds it becomes not-so-likely case? Then, MMF_MEMDIE is better than test_tsk_thread_flag(p, TIF_MEMDIE)... > > Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>