JFYI: I've found some more issues while hamerring this more. Please ignore this and the follow up patch for now. If others are OK with the cleanups preceding this patch I will repost with the changes based on the feedback so far and let them merge into mm tree before I settle with this much more tricky part. On Wed 08-07-15 14:27:51, Michal Hocko wrote: > From: Michal Hocko <mhocko@xxxxxxx> > > mm_struct::owner keeps track of the task which is in charge for the > specific mm. This is usually the thread group leader of the process but > there are exotic cases where this doesn't hold. > > The most prominent one is when separate tasks (not in the same thread > group) share the address space (by using clone with CLONE_VM without > CLONE_THREAD). The first task will be the owner until it exits. > mm_update_next_owner will then try to find a new owner - a task which > points to the same mm_struct. There is no guarantee a new owner will > be a thread group leader though because the leader for that thread > group might have exited. Even though such a thread will be still around > waiting for the remaining threads from its group, it's mm will be NULL > so it cannot be chosen. > > cgroup migration code, however assumes only group leaders when migrating > via cgroup.procs (which will be the only mode in the unified hierarchy > API) while mem_cgroup_can_attach considers only those tasks which are > owner of the mm. So we might end up with tasks which cannot be migrated. > mm_update_next_owner could be tweaked to try harder and use a group > leader whenever possible but this will never be 100% because all the > leaders might be dead. It seems that getting rid of the mm->owner sounds > like a better and less hacky option. > > The whole concept of the mm owner is a bit artificial and too tricky to > get right. All the memcg code needs is to find struct mem_cgroup from > a given mm_struct and there are only two events when the association > is either built or changed > - a new mm is created - dup_mmm resp exec_mmap - when the memcg > is inherited from the oldmm > - task associated with the mm is moved to another memcg > So it is much more easier to bind mm_struct with the mem_cgroup directly > rather than indirectly via a task. This is exactly what this patch does. > > mm_inherit_memcg and mm_drop_memcg are exported for the core kernel > to bind an old memcg during dup_mm (fork) resp. exec_mmap (exec) and > releasing that memcg in mmput after the last reference is dropped and no > task sees the mm anymore. We have to be careful and take a reference to > the memcg->css so that it doesn't vanish from under our feet. > > The only remaining part is to catch task migration and change the > association. This is done in mem_cgroup_move_task before charges get > moved because mem_cgroup_can_attach is too early and other controllers > might fail and we would have to handle the rollback. > > mm->memcg conforms to standard mem_cgroup locking rules. It has to be > used inside rcu_read_{un}lock() and a reference has to be taken before the > unlock if the memcg is supposed to be used outside. > > Finally mem_cgroup_can_attach will allow task migration only for the > thread group leaders to conform with cgroup core requirements. > > Please note that this patch introduces a USER VISIBLE CHANGE OF BEHAVIOR. > Without mm->owner _all_ tasks (group leaders to be precise) associated > with the mm_struct would initiate memcg migration while previously > only owner of the mm_struct could do that. The original behavior was > awkward though because the user task didn't have any means to find out > the current owner (esp. after mm_update_next_owner) so the migration > behavior was not well defined in general. > New cgroup API (unified hierarchy) will discontinue tasks cgroup file > which means that migrating threads will no longer be possible. In such > a case having CLONE_VM without CLONE_THREAD could emulate the thread > behavior but this patch prevents from isolating memcg controllers from > others. Nevertheless I am not convinced such a use case would really > deserve complications on the memcg code side. > > Suggested-by: Oleg Nesterov <oleg@xxxxxxxxxx> > Signed-off-by: Michal Hocko <mhocko@xxxxxxx> > --- > fs/exec.c | 2 +- > include/linux/memcontrol.h | 58 ++++++++++++++++++++++++-- > include/linux/mm_types.h | 12 +----- > kernel/exit.c | 89 --------------------------------------- > kernel/fork.c | 10 +---- > mm/debug.c | 4 +- > mm/memcontrol.c | 101 ++++++++++++++++++++++++++++----------------- > 7 files changed, 123 insertions(+), 153 deletions(-) > > diff --git a/fs/exec.c b/fs/exec.c > index 1977c2a553ac..3ed9c0abc9f5 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -870,7 +870,7 @@ static int exec_mmap(struct mm_struct *mm) > up_read(&old_mm->mmap_sem); > BUG_ON(active_mm != old_mm); > setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm); > - mm_update_next_owner(old_mm); > + mm_inherit_memcg(mm, old_mm); > mmput(old_mm); > return 0; > } > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 78e9d4ac57a1..8e6b2444ebfe 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -274,6 +274,52 @@ struct mem_cgroup { > extern struct cgroup_subsys_state *mem_cgroup_root_css; > > /** > + * __mm_set_memcg - Set mm_struct:memcg to a given memcg. > + * @mm: mm struct > + * @memcg: mem_cgroup to be used > + * > + * Note that this function doesn't clean up the previous mm->memcg. > + * This should be done by caller when necessary (e.g. when moving > + * mm from one memcg to another). > + */ > +static inline > +void __mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg) > +{ > + if (memcg) > + css_get(&memcg->css); > + rcu_assign_pointer(mm->memcg, memcg); > +} > + > +/** > + * mm_inherit_memcg - Initialize mm_struct::memcg from an existing mm_struct > + * @newmm: new mm struct > + * @oldmm: old mm struct to inherit from > + * > + * Should be called for each new mm_struct. > + */ > +static inline > +void mm_inherit_memcg(struct mm_struct *newmm, struct mm_struct *oldmm) > +{ > + struct mem_cgroup *memcg = oldmm->memcg; > + > + __mm_set_memcg(newmm, memcg); > +} > + > +/** > + * mm_drop_iter - drop mm_struct::memcg association > + * @mm: mm struct > + * > + * Should be called after the mm has been removed from all tasks > + * and before it is freed (e.g. from mmput) > + */ > +static inline void mm_drop_memcg(struct mm_struct *mm) > +{ > + if (mm->memcg) > + css_put(&mm->memcg->css); > + mm->memcg = NULL; > +} > + > +/** > * mem_cgroup_events - count memory events against a cgroup > * @memcg: the memory cgroup > * @idx: the event index > @@ -305,7 +351,6 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); > bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg); > > struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page); > -struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); > > struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg); > static inline > @@ -335,7 +380,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm, > bool match = false; > > rcu_read_lock(); > - task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); > + task_memcg = rcu_dereference(mm->memcg); > if (task_memcg) > match = mem_cgroup_is_descendant(task_memcg, memcg); > rcu_read_unlock(); > @@ -474,7 +519,7 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, > return; > > rcu_read_lock(); > - memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); > + memcg = rcu_dereference(mm->memcg); > if (unlikely(!memcg)) > goto out; > > @@ -498,6 +543,13 @@ void mem_cgroup_split_huge_fixup(struct page *head); > #else /* CONFIG_MEMCG */ > struct mem_cgroup; > > +static inline void mm_inherit_memcg(struct mm_struct *newmm, struct mm_struct *oldmm) > +{ > +} > +static inline void mm_drop_memcg(struct mm_struct *mm) > +{ > +} > + > static inline void mem_cgroup_events(struct mem_cgroup *memcg, > enum mem_cgroup_events_index idx, > unsigned int nr) > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index f6266742ce1f..93dc8cb9c636 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -426,17 +426,7 @@ struct mm_struct { > struct kioctx_table __rcu *ioctx_table; > #endif > #ifdef CONFIG_MEMCG > - /* > - * "owner" points to a task that is regarded as the canonical > - * user/owner of this mm. All of the following must be true in > - * order for it to be changed: > - * > - * current == mm->owner > - * current->mm != mm > - * new_owner->mm == mm > - * new_owner->alloc_lock is held > - */ > - struct task_struct __rcu *owner; > + struct mem_cgroup __rcu *memcg; > #endif > > /* store ref to file /proc/<pid>/exe symlink points to */ > diff --git a/kernel/exit.c b/kernel/exit.c > index 185752a729f6..339554612677 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -292,94 +292,6 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent) > } > } > > -#ifdef CONFIG_MEMCG > -/* > - * A task is exiting. If it owned this mm, find a new owner for the mm. > - */ > -void mm_update_next_owner(struct mm_struct *mm) > -{ > - struct task_struct *c, *g, *p = current; > - > -retry: > - /* > - * If the exiting or execing task is not the owner, it's > - * someone else's problem. > - */ > - if (mm->owner != p) > - return; > - /* > - * The current owner is exiting/execing and there are no other > - * candidates. Do not leave the mm pointing to a possibly > - * freed task structure. > - */ > - if (atomic_read(&mm->mm_users) <= 1) { > - mm->owner = NULL; > - return; > - } > - > - read_lock(&tasklist_lock); > - /* > - * Search in the children > - */ > - list_for_each_entry(c, &p->children, sibling) { > - if (c->mm == mm) > - goto assign_new_owner; > - } > - > - /* > - * Search in the siblings > - */ > - list_for_each_entry(c, &p->real_parent->children, sibling) { > - if (c->mm == mm) > - goto assign_new_owner; > - } > - > - /* > - * Search through everything else, we should not get here often. > - */ > - for_each_process(g) { > - if (g->flags & PF_KTHREAD) > - continue; > - for_each_thread(g, c) { > - if (c->mm == mm) > - goto assign_new_owner; > - if (c->mm) > - break; > - } > - } > - read_unlock(&tasklist_lock); > - /* > - * We found no owner yet mm_users > 1: this implies that we are > - * most likely racing with swapoff (try_to_unuse()) or /proc or > - * ptrace or page migration (get_task_mm()). Mark owner as NULL. > - */ > - mm->owner = NULL; > - return; > - > -assign_new_owner: > - BUG_ON(c == p); > - get_task_struct(c); > - /* > - * The task_lock protects c->mm from changing. > - * We always want mm->owner->mm == mm > - */ > - task_lock(c); > - /* > - * Delay read_unlock() till we have the task_lock() > - * to ensure that c does not slip away underneath us > - */ > - read_unlock(&tasklist_lock); > - if (c->mm != mm) { > - task_unlock(c); > - put_task_struct(c); > - goto retry; > - } > - mm->owner = c; > - task_unlock(c); > - put_task_struct(c); > -} > -#endif /* CONFIG_MEMCG */ > - > /* > * Turn us into a lazy TLB process if we > * aren't already.. > @@ -433,7 +345,6 @@ static void exit_mm(struct task_struct *tsk) > up_read(&mm->mmap_sem); > enter_lazy_tlb(mm, current); > task_unlock(tsk); > - mm_update_next_owner(mm); > mmput(mm); > if (test_thread_flag(TIF_MEMDIE)) > exit_oom_victim(); > diff --git a/kernel/fork.c b/kernel/fork.c > index 16e0f872f084..d073b6249d98 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -570,13 +570,6 @@ static void mm_init_aio(struct mm_struct *mm) > #endif > } > > -static void mm_init_owner(struct mm_struct *mm, struct task_struct *p) > -{ > -#ifdef CONFIG_MEMCG > - mm->owner = p; > -#endif > -} > - > static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) > { > mm->mmap = NULL; > @@ -596,7 +589,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) > spin_lock_init(&mm->page_table_lock); > mm_init_cpumask(mm); > mm_init_aio(mm); > - mm_init_owner(mm, p); > mmu_notifier_mm_init(mm); > clear_tlb_flush_pending(mm); > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS > @@ -702,6 +694,7 @@ void mmput(struct mm_struct *mm) > } > if (mm->binfmt) > module_put(mm->binfmt->module); > + mm_drop_memcg(mm); > mmdrop(mm); > } > } > @@ -926,6 +919,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk) > if (mm->binfmt && !try_module_get(mm->binfmt->module)) > goto free_pt; > > + mm_inherit_memcg(mm, oldmm); > return mm; > > free_pt: > diff --git a/mm/debug.c b/mm/debug.c > index 3eb3ac2fcee7..d0347a168651 100644 > --- a/mm/debug.c > +++ b/mm/debug.c > @@ -184,7 +184,7 @@ void dump_mm(const struct mm_struct *mm) > "ioctx_table %p\n" > #endif > #ifdef CONFIG_MEMCG > - "owner %p " > + "memcg %p " > #endif > "exe_file %p\n" > #ifdef CONFIG_MMU_NOTIFIER > @@ -218,7 +218,7 @@ void dump_mm(const struct mm_struct *mm) > mm->ioctx_table, > #endif > #ifdef CONFIG_MEMCG > - mm->owner, > + mm->memcg, > #endif > mm->exe_file, > #ifdef CONFIG_MMU_NOTIFIER > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 19ffae804076..4069ec8f52be 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -294,6 +294,18 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id) > return mem_cgroup_from_css(css); > } > > +static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p) > +{ > + if (p->mm) > + return rcu_dereference(p->mm->memcg); > + > + /* > + * If the process doesn't have mm struct anymore we have to fallback > + * to the task_css. > + */ > + return mem_cgroup_from_css(task_css(p, memory_cgrp_id)); > +} > + > /* Writing them here to avoid exposing memcg's inner layout */ > #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) > > @@ -783,19 +795,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) > } > } > > -struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p) > -{ > - /* > - * mm_update_next_owner() may clear mm->owner to NULL > - * if it races with swapoff, page migration, etc. > - * So this can be called with p == NULL. > - */ > - if (unlikely(!p)) > - return NULL; > - > - return mem_cgroup_from_css(task_css(p, memory_cgrp_id)); > -} > - > static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) > { > struct mem_cgroup *memcg = NULL; > @@ -810,7 +809,7 @@ static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) > if (unlikely(!mm)) > memcg = root_mem_cgroup; > else { > - memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); > + memcg = rcu_dereference(mm->memcg); > if (unlikely(!memcg)) > memcg = root_mem_cgroup; > } > @@ -2286,7 +2285,7 @@ void __memcg_kmem_put_cache(struct kmem_cache *cachep) > } > > /* > - * We need to verify if the allocation against current->mm->owner's memcg is > + * We need to verify if the allocation against current->mm->memcg is > * possible for the given order. But the page is not allocated yet, so we'll > * need a further commit step to do the final arrangements. > * > @@ -4737,7 +4736,7 @@ static void mem_cgroup_clear_mc(void) > static int mem_cgroup_can_attach(struct cgroup_subsys_state *css, > struct cgroup_taskset *tset) > { > - struct mem_cgroup *memcg = mem_cgroup_from_css(css); > + struct mem_cgroup *to = mem_cgroup_from_css(css); > struct mem_cgroup *from; > struct task_struct *p; > struct mm_struct *mm; > @@ -4749,37 +4748,49 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css, > * tunable will only affect upcoming migrations, not the current one. > * So we need to save it, and keep it going. > */ > - move_flags = READ_ONCE(memcg->move_charge_at_immigrate); > + move_flags = READ_ONCE(to->move_charge_at_immigrate); > if (!move_flags) > return 0; > > p = cgroup_taskset_first(tset); > - from = mem_cgroup_from_task(p); > - > - VM_BUG_ON(from == memcg); > + if (!thread_group_leader(p)) > + return 0; > > mm = get_task_mm(p); > if (!mm) > return 0; > - /* We move charges only when we move a owner of the mm */ > - if (mm->owner == p) { > - VM_BUG_ON(mc.from); > - VM_BUG_ON(mc.to); > - VM_BUG_ON(mc.precharge); > - VM_BUG_ON(mc.moved_charge); > - VM_BUG_ON(mc.moved_swap); > - > - spin_lock(&mc.lock); > - mc.from = from; > - mc.to = memcg; > - mc.flags = move_flags; > - spin_unlock(&mc.lock); > - /* We set mc.moving_task later */ > - > - ret = mem_cgroup_precharge_mc(mm); > - if (ret) > - mem_cgroup_clear_mc(); > - } > + > + /* > + * tasks' cgroup might be different from the one p->mm is associated > + * with because CLONE_VM is allowed without CLONE_THREAD. The task is > + * moving so we have to migrate from the memcg associated with its > + * address space. > + * No need to take a reference here because the memcg is pinned by the > + * mm_struct. > + */ > + from = READ_ONCE(mm->memcg); > + if (!from) > + from = root_mem_cgroup; > + if (from == to) > + goto out; > + > + VM_BUG_ON(mc.from); > + VM_BUG_ON(mc.to); > + VM_BUG_ON(mc.precharge); > + VM_BUG_ON(mc.moved_charge); > + VM_BUG_ON(mc.moved_swap); > + > + spin_lock(&mc.lock); > + mc.from = from; > + mc.to = to; > + mc.flags = move_flags; > + spin_unlock(&mc.lock); > + /* We set mc.moving_task later */ > + > + ret = mem_cgroup_precharge_mc(mm); > + if (ret) > + mem_cgroup_clear_mc(); > +out: > mmput(mm); > return ret; > } > @@ -4932,14 +4943,26 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css, > { > struct task_struct *p = cgroup_taskset_first(tset); > struct mm_struct *mm = get_task_mm(p); > + struct mem_cgroup *old_memcg = NULL; > > if (mm) { > + old_memcg = READ_ONCE(mm->memcg); > + __mm_set_memcg(mm, mem_cgroup_from_css(css)); > + > if (mc.to) > mem_cgroup_move_charge(mm); > mmput(mm); > } > if (mc.to) > mem_cgroup_clear_mc(); > + > + /* > + * Be careful and drop the reference only after we are done because > + * p's task_css memcg might be different from p->memcg and nothing else > + * might be pinning the old memcg. > + */ > + if (old_memcg) > + css_put(&old_memcg->css); > } > #else /* !CONFIG_MMU */ > static int mem_cgroup_can_attach(struct cgroup_subsys_state *css, > -- > 2.1.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>