On 2018/10/09 16:50, Michal Hocko wrote: > On Tue 09-10-18 08:35:41, Michal Hocko wrote: >> [I have only now noticed that the patch has been reposted] >> >> On Mon 08-10-18 18:27:39, Tetsuo Handa wrote: >>> On 2018/10/08 17:38, Yong-Taek Lee wrote: > [...] >>>> Thank you for your suggestion. But i think it would be better to seperate to 2 issues. How about think these >>>> issues separately because there are no dependency between race issue and my patch. As i already explained, >>>> for_each_process path is meaningless if there is only one thread group with many threads(mm_users > 1 but >>>> no other thread group sharing same mm). Do you have any other idea to avoid meaningless loop ? >>> >>> Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure processes >>> sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded ("mm, oom: >>> kill all tasks sharing the mm"). >> >> This would require a lot of other work for something as border line as >> weird threading model like this. I will think about something more >> appropriate - e.g. we can take mmap_sem for read while doing this check >> and that should prevent from races with [v]fork. > > Not really. We do not even take the mmap_sem when CLONE_VM. So this is > not the way. Doing a proper synchronization seems much harder. So let's > consider what is the worst case scenario. We would basically hit a race > window between copy_signal and copy_mm and the only relevant case would > be OOM_SCORE_ADJ_MIN which wouldn't propagate to the new "thread". The "between copy_signal() and copy_mm()" race window is merely whether we need to run for_each_process() loop. The race window is much larger than that; it is between "copy_signal() copies oom_score_adj/oom_score_adj_min" and "the created thread becomes accessible from for_each_process() loop". > OOM > killer could then pick up the "thread" and kill it along with the whole > process group sharing the mm. Just reverting commit 44a70adec910d692 and commit 97fd49c2355ffded is sufficient. > Well, that is unfortunate indeed and it > breaks the OOM_SCORE_ADJ_MIN contract. There are basically two ways here > 1) do not care and encourage users to use a saner way to set > OOM_SCORE_ADJ_MIN because doing that externally is racy anyway e.g. > setting it before [v]fork & exec. Btw. do we know about an actual user > who would care? I'm not talking about [v]fork & exec. Why are you talking about [v]fork & exec ? > 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not > reap the mm in the rare case of the race. That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj to -1000 (and allowed unprivileged users to OOM-lockup the system). Now that we set MMF_OOM_SKIP, there is no need to worry about "oom_score_adj != -1000" thread group and "oom_score_adj == -1000" thread group sharing the same mm. Since updating oom_score_adj to -1000 is a privileged operation, it is administrator's wish if such case happened; the kernel should respect the administrator's wish. > > I would prefer the firs but if this race really has to be addressed then > the 2 sounds more reasonable than the wholesale revert. >