On Tue, 8 Jun 2010, Andrew Morton wrote: > > The oom killer presently kills current whenever there is no more memory > > free or reclaimable on its mempolicy's nodes. There is no guarantee that > > current is a memory-hogging task or that killing it will free any > > substantial amount of memory, however. > > > > In such situations, it is better to scan the tasklist for nodes that are > > allowed to allocate on current's set of nodes and kill the task with the > > highest badness() score. This ensures that the most memory-hogging task, > > or the one configured by the user with /proc/pid/oom_adj, is always > > selected in such scenarios. > > > > > > ... > > > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -27,6 +27,7 @@ > > #include <linux/module.h> > > #include <linux/notifier.h> > > #include <linux/memcontrol.h> > > +#include <linux/mempolicy.h> > > #include <linux/security.h> > > > > int sysctl_panic_on_oom; > > @@ -36,20 +37,36 @@ static DEFINE_SPINLOCK(zone_scan_lock); > > /* #define DEBUG */ > > > > /* > > - * Is all threads of the target process nodes overlap ours? > > + * Do all threads of the target process overlap our allowed nodes? > > + * @tsk: task struct of which task to consider > > + * @mask: nodemask passed to page allocator for mempolicy ooms > > The comment uses kerneldoc annotation but isn't a kerneldoc comment. > I'll fix it. > > */ > > -static int has_intersects_mems_allowed(struct task_struct *tsk) > > +static bool has_intersects_mems_allowed(struct task_struct *tsk, > > + const nodemask_t *mask) > > { > > - struct task_struct *t; > > + struct task_struct *start = tsk; > > > > - t = tsk; > > do { > > - if (cpuset_mems_allowed_intersects(current, t)) > > - return 1; > > - t = next_thread(t); > > - } while (t != tsk); > > - > > - return 0; > > + if (mask) { > > + /* > > + * If this is a mempolicy constrained oom, tsk's > > + * cpuset is irrelevant. Only return true if its > > + * mempolicy intersects current, otherwise it may be > > + * needlessly killed. > > + */ > > + if (mempolicy_nodemask_intersects(tsk, mask)) > > + return true; > > The comment refers to `current' but the code does not? > mempolicy_nodemask_intersects() compares tsk's mempolicy to current's, we don't need to pass current into the function (and we optimize for that since we don't need to do task_lock(current): nothing else can change its mempolicy). > > + } else { > > + /* > > + * This is not a mempolicy constrained oom, so only > > + * check the mems of tsk's cpuset. > > + */ > > The comment doesn't refer to `current', but the code does. Confused. > This simply compares the cpuset mems_allowed of both tasks passed into the function. > > + if (cpuset_mems_allowed_intersects(current, tsk)) > > + return true; > > + } > > + tsk = next_thread(tsk); > > hm, next_thread() uses list_entry_rcu(). What are the locking rules > here? It's one of both of rcu_read_lock() and read_lock(&tasklist_lock), > I think? > Oleg addressed this in his response. > > + } while (tsk != start); > > + return false; > > } > > This is all bloat and overhead for non-NUMA builds. I doubt if gcc is > able to eliminate the task_struct walk (although I didn't check). > > The function isn't oom-killer-specific at all - give it a better name > then move it to mempolicy.c or similar? If so, the text "oom" > shouldn't appear in the comments. > It's the only place where we want to filter tasks based on whether they share mempolicy nodes or cpuset mems, though, so I think it's appropriately placed in mm/oom_kill.c. I agree that we can add a #ifndef CONFIG_NUMA variant and I'll do so, thanks. > > > > ... > > > > @@ -676,24 +699,19 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > > */ > > constraint = constrained_alloc(zonelist, gfp_mask, nodemask); > > read_lock(&tasklist_lock); > > - > > - switch (constraint) { > > - case CONSTRAINT_MEMORY_POLICY: > > - oom_kill_process(current, gfp_mask, order, 0, NULL, > > - "No available memory (MPOL_BIND)"); > > - break; > > - > > - case CONSTRAINT_NONE: > > - if (sysctl_panic_on_oom) { > > + if (unlikely(sysctl_panic_on_oom)) { > > + /* > > + * panic_on_oom only affects CONSTRAINT_NONE, the kernel > > + * should not panic for cpuset or mempolicy induced memory > > + * failures. > > + */ > > This wasn't changelogged? > It's not a functional change, sysctl_panic_on_oom == 2 is already handled earlier in the function. This was intended to elaborate on why we're only concerned about CONSTRAINT_NONE here since the switch statement was removed. > > + if (constraint == CONSTRAINT_NONE) { > > dump_header(NULL, gfp_mask, order, NULL); > > - panic("out of memory. panic_on_oom is selected\n"); > > + read_unlock(&tasklist_lock); > > + panic("Out of memory: panic_on_oom is enabled\n"); > > } > > - /* Fall-through */ > > - case CONSTRAINT_CPUSET: > > - __out_of_memory(gfp_mask, order); > > - break; > > } > > - > > + __out_of_memory(gfp_mask, order, constraint, nodemask); > > read_unlock(&tasklist_lock); > > > > /* > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>