[I am mostly offline for the rest of the week] On Wed 16-01-19 14:41:31, Michal Hocko wrote: > On Wed 16-01-19 22:32:50, Tetsuo Handa wrote: > > On 2019/01/16 21:19, Michal Hocko wrote: > > > On Wed 16-01-19 20:30:25, Tetsuo Handa wrote: > > >> On 2019/01/16 20:09, Michal Hocko wrote: > > >>> On Wed 16-01-19 19:55:21, Tetsuo Handa wrote: > > >>>> This patch reverts both commit 44a70adec910d692 ("mm, oom_adj: make sure > > >>>> processes sharing mm have same view of oom_score_adj") and commit > > >>>> 97fd49c2355ffded ("mm, oom: kill all tasks sharing the mm") in order to > > >>>> close a race and reduce the latency at __set_oom_adj(), and reduces the > > >>>> warning at __oom_kill_process() in order to minimize the latency. > > >>>> > > >>>> Commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE after oom_reaper managed > > >>>> to unmap the address space") introduced the worst case mentioned in > > >>>> 44a70adec910d692. But since the OOM killer skips mm with MMF_OOM_SKIP set, > > >>>> only administrators can trigger the worst case. > > >>>> > > >>>> Since 44a70adec910d692 did not take latency into account, we can hold RCU > > >>>> for minutes and trigger RCU stall warnings by calling printk() on many > > >>>> thousands of thread groups. Even without calling printk(), the latency is > > >>>> mentioned by Yong-Taek Lee [1]. And I noticed that 44a70adec910d692 is > > >>>> racy, and trying to fix the race will require a global lock which is too > > >>>> costly for rare events. > > >>>> > > >>>> If the worst case in 44a70adec910d692 happens, it is an administrator's > > >>>> request. Therefore, tolerate the worst case and speed up __set_oom_adj(). > > >>> > > >>> I really do not think we care about latency. I consider the overal API > > >>> sanity much more important. Besides that the original report you are > > >>> referring to was never exaplained/shown to represent real world usecase. > > >>> oom_score_adj is not really a an interface to be tweaked in hot paths. > > >> > > >> I do care about the latency. Holding RCU for more than 2 minutes is insane. > > > > > > Creating 8k threads could be considered insane as well. But more > > > seriously. I absolutely do not insist on holding a single RCU section > > > for the whole operation. But that doesn't really mean that we want to > > > revert these changes. for_each_process is by far not only called from > > > this path. > > > > Unlike check_hung_uninterruptible_tasks() where failing to resume after > > breaking RCU section is tolerable, failing to resume after breaking RCU > > section for __set_oom_adj() is not tolerable; it leaves the possibility > > of different oom_score_adj. > > Then make sure that no threads are really missed. Really I fail to see > what you are actually arguing about. for_each_process is expensive. No > question about that. If you can replace it for this specific and odd > usecase then go ahead. But there is absolutely zero reason to have a > broken oom_score_adj semantic just because somebody might have thousands > of threads and want to update the score faster. Btw. the current implementation annoyance is caused by the fact that the oom_score_adj is per signal_struct rather than mm_struct. The reason is that we really need: if (!vfork()) { set_oom_score_adj() exec() } to work properly. One way around that is to special case oom_score_adj for tasks in vfork and store their shadow value into the task_struct. The shadow value would get transfered over to the mm struct once a new one is allocated. So something very coarsly like short tsk_get_oom_score_adj(struct task_struct *tsk) { if (tsk->oom_score_adj != OOM_SCORE_ADJ_INVALID) return tsk->oom_score_adj; return tsk->signal->oom_score_adj; } use this helper instead of direct oom_score_adj usage. Then we need to special case the setting in __set_oom_adj and dup_mm to copy the value over instead of copy_signal. I think this is doable. -- Michal Hocko SUSE Labs