On Tue, 8 Jun 2010 16:31:29 +0900 Daisuke Nishimura <nishimura@xxxxxxxxxxxxxxxxx> wrote: > On Tue, 8 Jun 2010 12:19:01 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote: > > Now, I think pre_destroy->force_empty() works very well and we can get rid of > > css_put/get per pages. This has very big effect in some special case. > > > > This is a test result with a multi-thread page fault program > > (I used at rwsem discussion.) > > > > [Before patch] > > 25.72% multi-fault-all [kernel.kallsyms] [k] clear_page_c > > 8.18% multi-fault-all [kernel.kallsyms] [k] try_get_mem_cgroup_from_mm > > 8.17% multi-fault-all [kernel.kallsyms] [k] down_read_trylock > > 8.03% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irqsave > > 5.46% multi-fault-all [kernel.kallsyms] [k] __css_put > > 5.45% multi-fault-all [kernel.kallsyms] [k] __alloc_pages_nodemask > > 4.36% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irq > > 4.35% multi-fault-all [kernel.kallsyms] [k] up_read > > 3.59% multi-fault-all [kernel.kallsyms] [k] css_put > > 2.37% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock > > 1.80% multi-fault-all [kernel.kallsyms] [k] mem_cgroup_add_lru_list > > 1.78% multi-fault-all [kernel.kallsyms] [k] __rmqueue > > 1.65% multi-fault-all [kernel.kallsyms] [k] handle_mm_fault > > > > try_get_mem_cgroup_from_mm() is a one of heavy ops because of false-sharing in > > css's counter for css_get/put. > > > I'm sorry, what do you mean by "false-sharing" ? cacheline pingpong among cpus arount css' counter. > And I think it would be better to add these performance data to commit log. > yes, I'll move this when I remove RFC. > > I removed that. > > > > [After] > > 26.16% multi-fault-all [kernel.kallsyms] [k] clear_page_c > > 11.73% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock > > 9.23% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irqsave > > 9.07% multi-fault-all [kernel.kallsyms] [k] down_read_trylock > > 6.09% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irq > > 5.57% multi-fault-all [kernel.kallsyms] [k] __alloc_pages_nodemask > > 4.86% multi-fault-all [kernel.kallsyms] [k] up_read > > 2.54% multi-fault-all [kernel.kallsyms] [k] __mem_cgroup_commit_charge > > 2.29% multi-fault-all [kernel.kallsyms] [k] _cond_resched > > 2.04% multi-fault-all [kernel.kallsyms] [k] mem_cgroup_add_lru_list > > 1.82% multi-fault-all [kernel.kallsyms] [k] handle_mm_fault > > > > Hmm. seems nice. But I don't convince my patch has no race. > > I'll continue test but your help is welcome. > > > > == > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> > > > > Now, memory cgroup increments css(cgroup subsys state)'s reference > > count per a charged page. And the reference count is kept until > > the page is uncharged. But this has 2 bad effect. > > > > 1. Because css_get/put calls atoimic_inc()/dec, heavy call of them > > on large smp will not scale well. > I'm sorry if I'm asking a stupid question, the number of css_get/put > would be: > > before: > get:1 in charge > put:1 in uncharge > after: > get:1, put:1 in charge > no get/put in uncharge > > right ? No. before: get 1 in charge. put 1 at charge after: no get at charge in fast path (cunsume_stcok hits.) get 1 at accssing res_counter and reclaim, put 1 after it. no get/put in uncharge. > Then, isn't there any change as a whole ? > We get much benefit when consume_stock() works. > > 2. Because css's refcnt cannot be in a state as "ready-to-release", > > cgroup's notify_on_release handler can't work with memcg. > > > Yes, 2 is one of weak point of memcg, IMHO. > > > This is a trial to remove css's refcnt per a page. Even if we remove > > refcnt, pre_destroy() does enough synchronization. > > > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> > > --- > > mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++++++++++----------------- > > 1 file changed, 46 insertions(+), 20 deletions(-) > > > > Index: mmotm-2.6.34-Jun6/mm/memcontrol.c > > =================================================================== > > --- mmotm-2.6.34-Jun6.orig/mm/memcontrol.c > > +++ mmotm-2.6.34-Jun6/mm/memcontrol.c > > @@ -1717,25 +1717,49 @@ static int __mem_cgroup_try_charge(struc > > * thread group leader migrates. It's possible that mm is not > > * set, if so charge the init_mm (happens for pagecache usage). > > */ > > - if (*memcg) { > > + if (!*memcg && !mm) > > + goto bypass; > Shouldn't it be VM_BUG_ON(!*memcg && !mm) ? > IIUC, we were afraid of a thread with no mm comes here...at boot, etc.. I don't want to change the minor behavior in this set. > > +again: > > + if (*memcg) { /* css should be a valid one */ > > mem = *memcg; > > + VM_BUG_ON(css_is_removed(mem)); > > + if (mem_cgroup_is_root(mem)) > > + goto done; > > + if (consume_stock(mem)) > > + goto done; > > css_get(&mem->css); > > } else { > > - mem = try_get_mem_cgroup_from_mm(mm); > > - if (unlikely(!mem)) > > - return 0; > > - *memcg = mem; > > - } > > + struct task_struct *p; > > > > - VM_BUG_ON(css_is_removed(&mem->css)); > > - if (mem_cgroup_is_root(mem)) > > - goto done; > > + rcu_read_lock(); > > + p = rcu_dereference(mm->owner); > > + VM_BUG_ON(!p); > > + /* > > + * while task_lock, this task cannot be disconnected with > > + * the cgroup we see. > > + */ > > + task_lock(p); > > + mem = mem_cgroup_from_task(p); > > + VM_BUG_ON(!mem); > > + if (mem_cgroup_is_root(mem)) { > Shoudn't we do "*memcg = mem" here ? > hmm, how about doing: > > done: > *memcg = mem; > return 0; > > instead of doing "*memcg = mem" in some places ? > Ok, I'll consider about that. > > + task_unlock(p); > > + rcu_read_unlock(); > > + goto done; > > + } > > + if (consume_stock(mem)) { > > + *memcg = mem; > > + task_unlock(p); > > + rcu_read_unlock(); > > + goto done; > > + } > > + css_get(&mem->css); > > + task_unlock(p); > > + rcu_read_unlock(); > > + } > > > > do { > > bool oom_check; > > > > - if (consume_stock(mem)) > > - goto done; /* don't need to fill stock */ > > /* If killed, bypass charge */ > > if (fatal_signal_pending(current)) > > goto bypass; > > @@ -1750,10 +1774,13 @@ static int __mem_cgroup_try_charge(struc > > > > switch (ret) { > > case CHARGE_OK: > > + *memcg = mem; > > break; > > case CHARGE_RETRY: /* not in OOM situation but retry */ > > csize = PAGE_SIZE; > > - break; > > + css_put(&mem->css); > > + mem = NULL; > > + goto again; > > case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */ > > goto nomem; > > case CHARGE_NOMEM: /* OOM routine works */ > > @@ -1769,6 +1796,7 @@ static int __mem_cgroup_try_charge(struc > > > > if (csize > PAGE_SIZE) > > refill_stock(mem, csize - PAGE_SIZE); > > + css_put(&mem->css); > > done: > > return 0; > > nomem: > > @@ -1795,7 +1823,6 @@ static void __mem_cgroup_cancel_charge(s > > res_counter_uncharge(&mem->memsw, PAGE_SIZE * count); > > VM_BUG_ON(test_bit(CSS_ROOT, &mem->css.flags)); > > WARN_ON_ONCE(count > INT_MAX); > > - __css_put(&mem->css, (int)count); > > } > > /* we don't need css_put for root */ > > } > These VM_BUG_ON() and WARN_ON_ONCE() will be unnecessary, too. > ok. > > @@ -2158,7 +2185,6 @@ int mem_cgroup_try_charge_swapin(struct > > goto charge_cur_mm; > > *ptr = mem; > > ret = __mem_cgroup_try_charge(NULL, mask, ptr, true); > > - /* drop extra refcnt from tryget */ > > css_put(&mem->css); > > return ret; > > charge_cur_mm: > > @@ -2345,9 +2371,6 @@ __mem_cgroup_uncharge_common(struct page > > unlock_page_cgroup(pc); > > > > memcg_check_events(mem, page); > > - /* at swapout, this memcg will be accessed to record to swap */ > > - if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT) > > - css_put(&mem->css); > > > > return mem; > > > > @@ -2432,14 +2455,18 @@ mem_cgroup_uncharge_swapcache(struct pag > > if (!swapout) /* this was a swap cache but the swap is unused ! */ > > ctype = MEM_CGROUP_CHARGE_TYPE_DROP; > > > > - memcg = __mem_cgroup_uncharge_common(page, ctype); > > + memcg = try_get_mem_cgroup_from_page(page); > > + if (!memcg) > > + return; > > + > > + __mem_cgroup_uncharge_common(page, ctype); > > > > /* record memcg information */ > > if (do_swap_account && swapout && memcg) { > > swap_cgroup_record(ent, css_id(&memcg->css)); > > mem_cgroup_get(memcg); > > } > > - if (swapout && memcg) > > + if (memcg) > > css_put(&memcg->css); > > } > > #endif > "if (memcg)" is unnecessary(it's checked above). > Sure. > > @@ -4219,7 +4246,6 @@ static int mem_cgroup_do_precharge(unsig > > mc.precharge += count; > > VM_BUG_ON(test_bit(CSS_ROOT, &mem->css.flags)); > > WARN_ON_ONCE(count > INT_MAX); > > - __css_get(&mem->css, (int)count); > > return ret; > > } > > one_by_one: > > > ditto. > ok. > IIUC this patch, we should remove css_put() in mem_cgroup_move_swap_account() > and __css_put() in mem_cgroup_clear_mc() too, and modify some comments. > Anyway, we must test these changes carefully. > will do in the next version. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>