On 2019/01/11 20:33, Michal Hocko wrote: > On Fri 11-01-19 19:25:22, Tetsuo Handa wrote: >> On 2019/01/11 8:59, Tetsuo Handa wrote: >>> Michal Hocko wrote: >>>> On Wed 09-01-19 20:34:46, Tetsuo Handa wrote: >>>>> On 2019/01/09 20:03, Michal Hocko wrote: >>>>>> Tetsuo, >>>>>> can you confirm that these two patches are fixing the issue you have >>>>>> reported please? >>>>>> >>>>> >>>>> My patch fixes the issue better than your "[PATCH 2/2] memcg: do not >>>>> report racy no-eligible OOM tasks" does. >>>> >>>> OK, so we are stuck again. Hooray! >>> >>> Andrew, will you pick up "[PATCH 3/2] memcg: Facilitate termination of memcg OOM victims." ? >>> Since mm-oom-marks-all-killed-tasks-as-oom-victims.patch does not call mark_oom_victim() >>> when task_will_free_mem() == true, memcg-do-not-report-racy-no-eligible-oom-tasks.patch >>> does not close the race whereas my patch closes the race better. >>> >> >> I confirmed that mm-oom-marks-all-killed-tasks-as-oom-victims.patch and >> memcg-do-not-report-racy-no-eligible-oom-tasks.patch are completely failing >> to fix the issue I am reporting. :-( > > OK, this is really interesting. This means that we are racing > when marking all the tasks sharing the mm with the clone syscall. Nothing interesting. This is NOT a race between clone() and the OOM killer. :-( By the moment the OOM killer is invoked, all clone() requests are already completed. Did you notice that there is no "Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n" line between [ 71.304703][ T9694] Memory cgroup out of memory: Kill process 9692 (a.out) score 904 or sacrifice child and [ 71.309149][ T54] oom_reaper: reaped process 9750 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:185532kB ? Then, you will find that [ T9694] failed to reach for_each_process(p) loop inside __oom_kill_process() in the first round of out_of_memory() call because find_lock_task_mm() == NULL at __oom_kill_process() because Ctrl-C made that victim complete exit_mm() before find_lock_task_mm() is called. Then, in the second round of out_of_memory() call, [ T9750] (which is fatal_signal_pending() == T && tsk_is_oom_victim() == F) hit task_will_free_mem(current) path and called mark_oom_victim() and woke up the OOM reaper. Then, before the third round of out_of_memory() call starts, the OOM reaper set MMF_OOM_SKIP. When the third round of out_of_memory() call started, [ T9748] could not hit task_will_free_mem(current) path because MMF_OOM_SKIP was already set, and oom_badness() ignored any mm which already has MMF_OOM_SKIP. As a result, [ T9748] failed to find a candidate. And this step repeats for up to number of threads (213 times for this run). > Does fatal_signal_pending handle this better? > Of course. My patch handles it perfectly. Even if we raced with clone() requests, why do we need to care about threads doing clone() requests? Such threads are not inside try_charge(), and therefore such threads can't contribute to this issue by calling out_of_memory() from try_charge().