On Wed 07-03-18 15:52:15, David Rientjes wrote: > Since the 2.6 kernel, the oom killer has slightly biased away from > CAP_SYS_ADMIN processes by discounting some of its memory usage in > comparison to other processes. > > This has always been implicit and nothing exactly relies on the behavior. > > Gaurav notices that __task_cred() can dereference a potentially freed > pointer if the task under consideration is exiting because a reference to > the task_struct is not held. > > Remove the CAP_SYS_ADMIN bias so that all processes are treated equally. > > If any CAP_SYS_ADMIN process would like to be biased against, it is always > allowed to adjust /proc/pid/oom_score_adj. > > Reported-by: Gaurav Kohli <gkohli@xxxxxxxxxxxxxx> > Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx> This is simpler than playing reference counting tricks and whatnot. Moreover I do agree that this heuristic is questionable on its own. The bias is basically random and invisible to the userspace. We already have a way to tune the same thing by oom_score_adj Acked-by: Michal Hocko <mhocko@xxxxxxxx> > --- > mm/oom_kill.c | 7 ------- > 1 file changed, 7 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -224,13 +224,6 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, > mm_pgtables_bytes(p->mm) / PAGE_SIZE; > task_unlock(p); > > - /* > - * Root processes get 3% bonus, just like the __vm_enough_memory() > - * implementation used by LSMs. > - */ > - if (has_capability_noaudit(p, CAP_SYS_ADMIN)) > - points -= (points * 3) / 100; > - > /* Normalize to oom_score_adj units */ > adj *= totalpages / 1000; > points += adj; -- Michal Hocko SUSE Labs