2011/5/20 KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>: > CAI Qian reported oom-killer killed all system daemons in his > system at first if he ran fork bomb as root. The problem is, > current logic give them bonus of 3% of system ram. Example, > he has 16GB machine, then root processes have ~500MB oom > immune. It bring us crazy bad result. _all_ processes have > oom-score=1 and then, oom killer ignore process memory usage > and kill random process. This regression is caused by commit > a63d83f427 (oom: badness heuristic rewrite). > > This patch changes select_bad_process() slightly. If oom points == 1, > it's a sign that the system have only root privileged processes or > similar. Thus, select_bad_process() calculate oom badness without > root bonus and select eligible process. > > Also, this patch move finding sacrifice child logic into > select_bad_process(). It's necessary to implement adequate > no root bonus recalculation. and it makes good side effect, > current logic doesn't behave as the doc. > > Documentation/sysctl/vm.txt says > >  Âoom_kill_allocating_task > >  ÂIf this is set to non-zero, the OOM killer simply kills the task that >  Âtriggered the out-of-memory condition. ÂThis avoids the expensive >  Âtasklist scan. > > IOW, oom_kill_allocating_task shouldn't search sacrifice child. > This patch also fixes this issue. > > Reported-by: CAI Qian <caiqian@xxxxxxxxxx> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx> > --- > Âfs/proc/base.c   Â|  Â2 +- > Âinclude/linux/oom.h |  Â3 +- > Âmm/oom_kill.c    |  89 ++++++++++++++++++++++++++++---------------------- > Â3 files changed, 53 insertions(+), 41 deletions(-) > > diff --git a/fs/proc/base.c b/fs/proc/base.c > index d6b0424..b608b69 100644 > --- a/fs/proc/base.c > +++ b/fs/proc/base.c > @@ -482,7 +482,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer) > >    Âread_lock(&tasklist_lock); >    Âif (pid_alive(task)) { > -        points = oom_badness(task, NULL, NULL, totalpages); > +        points = oom_badness(task, NULL, NULL, totalpages, 1); >        Âratio = points * 1000 / totalpages; >    Â} >    Âread_unlock(&tasklist_lock); > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 0f5b588..3dd3669 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -42,7 +42,8 @@ enum oom_constraint { > > Â/* The badness from the OOM killer */ > Âextern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem, > -            const nodemask_t *nodemask, unsigned long totalpages); > +            const nodemask_t *nodemask, unsigned long totalpages, > +            int protect_root); > Âextern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags); > Âextern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags); > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 8bbc3df..7d280d4 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -133,7 +133,8 @@ static bool oom_unkillable_task(struct task_struct *p, > Â* task consuming the most memory to avoid subsequent oom failures. > Â*/ > Âunsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem, > -           const nodemask_t *nodemask, unsigned long totalpages) > +            Âconst nodemask_t *nodemask, unsigned long totalpages, > +            Âint protect_root) > Â{ >    Âunsigned long points; >    Âunsigned long score_adj = 0; > @@ -186,7 +187,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem, >     * >     * XXX: Too large bonus, example, if the system have tera-bytes memory.. >     */ > -    if (has_capability_noaudit(p, CAP_SYS_ADMIN)) { > +    if (protect_root && has_capability_noaudit(p, CAP_SYS_ADMIN)) { >        Âif (points >= totalpages / 32) >            Âpoints -= totalpages / 32; >        Âelse > @@ -298,8 +299,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > Â{ >    Âstruct task_struct *g, *p; >    Âstruct task_struct *chosen = NULL; > -    *ppoints = 0; > +    int protect_root = 1; > +    unsigned long chosen_points = 0; > +    struct task_struct *child; > > + retry: >    Âdo_each_thread_reverse(g, p) { >        Âunsigned long points; > > @@ -332,7 +336,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, >             */ >            Âif (p == current) { >                Âchosen = p; > -                *ppoints = ULONG_MAX; > +                chosen_points = ULONG_MAX; >            Â} else { >                Â/* >                 * If this task is not being ptraced on exit, > @@ -345,13 +349,49 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, >            Â} >        Â} > > -        points = oom_badness(p, mem, nodemask, totalpages); > -        if (points > *ppoints) { > +        points = oom_badness(p, mem, nodemask, totalpages, protect_root); > +        if (points > chosen_points) { >            Âchosen = p; > -            *ppoints = points; > +            chosen_points = points; >        Â} >    Â} while_each_thread(g, p); > > +    /* > +    Â* chosen_point==1 may be a sign that root privilege bonus is too large > +    Â* and we choose wrong task. Let's recalculate oom score without the > +    Â* dubious bonus. > +    Â*/ > +    if (protect_root && (chosen_points == 1)) { > +        protect_root = 0; > +        goto retry; > +    } The idea is good to me. But once we meet it, should we give up protecting root privileged processes? How about decaying bonus point? -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href