Still RFC, added lkml to CC: list. == From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> Now, memory cgroup increments css(cgroup subsys state)'s reference count per a charged page. And the reference count is kept until the page is uncharged. But this has 2 bad effect. 1. Because css_get/put calls atoimic_inc()/dec, heavy call of them on large smp will not scale well. 2. Because css's refcnt cannot be in a state as "ready-to-release", cgroup's notify_on_release handler can't work with memcg. This is a trial to remove css's refcnt per a page. Even if we remove refcnt, pre_destroy() does enough synchronization. After this patch, it seems css_get() is still called in try_charge(). But the logic is. 1. task_lock(mm->owner) 2. If a memcg of mm->owner is cached one, consume_stock() will work. If it returns true, task_unlock and return immediately. 3. If a memcg doesn't hit cached one, css_get() and access res_counter and do charge. So, in the fast path, we don't call css_get() and can avoid access to shared counter. The point is that while there are used resource, memcg can't be freed. So, if it's cached, we don't have to take css_get(). Brief test result on mmotm-2.6.32-Jul-06 + Nishimura-san's cleanup. test with multi-threaded page fault program on 8cpu system. (Each threads touch 2M of pages and release them by madvice().) [Before] 39522457 page-faults 549387860 cache-misses (cache-miss/page-fault 13.90) 60.003715887 seconds time elapsed 25.75% multi-fault-all [kernel.kallsyms] [k] clear_page_c 8.59% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irqsave 8.22% multi-fault-all [kernel.kallsyms] [k] try_get_mem_cgroup_from_mm 8.01% multi-fault-all [kernel.kallsyms] [k] down_read_trylock 5.41% multi-fault-all [kernel.kallsyms] [k] __alloc_pages_nodemask 5.41% multi-fault-all [kernel.kallsyms] [k] __css_put 4.48% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irq 4.41% multi-fault-all [kernel.kallsyms] [k] up_read 3.60% multi-fault-all [kernel.kallsyms] [k] css_put 1.93% multi-fault-all [kernel.kallsyms] [k] handle_mm_fault 1.91% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock 1.89% multi-fault-all [kernel.kallsyms] [k] _cond_resched 1.78% multi-fault-all [kernel.kallsyms] [k] __rmqueue 1.69% multi-fault-all [kernel.kallsyms] [k] mem_cgroup_add_lru_list 1.63% multi-fault-all [kernel.kallsyms] [k] page_fault 1.45% multi-fault-all [kernel.kallsyms] [k] __mem_cgroup_commit_charge 1.34% multi-fault-all [kernel.kallsyms] [k] find_vma [After] 43253862 page-faults 505750203 cache-misses (cache-miss/page-fault 11.69) 60.004123555 seconds time elapsed 27.98% multi-fault-all [kernel.kallsyms] [k] clear_page_c 9.89% multi-fault-all [kernel.kallsyms] [k] down_read_trylock 9.88% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irq 9.37% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irqsave 5.91% multi-fault-all [kernel.kallsyms] [k] __alloc_pages_nodemask 5.69% multi-fault-all [kernel.kallsyms] [k] up_read 2.94% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock 2.70% multi-fault-all [kernel.kallsyms] [k] handle_mm_fault 2.66% multi-fault-all [kernel.kallsyms] [k] mem_cgroup_add_lru_list 2.25% multi-fault-all [kernel.kallsyms] [k] __mem_cgroup_commit_charge 1.90% multi-fault-all [kernel.kallsyms] [k] __rmqueue 1.74% multi-fault-all [kernel.kallsyms] [k] page_fault 1.52% multi-fault-all [kernel.kallsyms] [k] find_vma Then, overhead is reduced. Note: Because this is an extreme, not-realistic, test, I'm not sure how this change will be benefits for usual applications. Changelog 20100609: - clean up try_charge(). - fixed mem_cgroup_clear_mc - removed unnecessary warnings. - removed task_lock, added more comments. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> --- mm/memcontrol.c | 96 ++++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 63 insertions(+), 33 deletions(-) Index: mmotm-2.6.34-Jun6/mm/memcontrol.c =================================================================== --- mmotm-2.6.34-Jun6.orig/mm/memcontrol.c +++ mmotm-2.6.34-Jun6/mm/memcontrol.c @@ -1717,28 +1717,61 @@ static int __mem_cgroup_try_charge(struc * thread group leader migrates. It's possible that mm is not * set, if so charge the init_mm (happens for pagecache usage). */ - if (*memcg) { + if (!*memcg && !mm) + goto bypass; +again: + if (*memcg) { /* css should be a valid one */ mem = *memcg; + VM_BUG_ON(css_is_removed(&mem->css)); + if (mem_cgroup_is_root(mem)) + goto done; + if (consume_stock(mem)) + goto done; css_get(&mem->css); } else { - mem = try_get_mem_cgroup_from_mm(mm); - if (unlikely(!mem)) - return 0; - *memcg = mem; - } + struct task_struct *p; - VM_BUG_ON(css_is_removed(&mem->css)); - if (mem_cgroup_is_root(mem)) - goto done; + rcu_read_lock(); + p = rcu_dereference(mm->owner); + VM_BUG_ON(!p); + /* + * It seems there are some races when 'p' exits. But, at exit(), + * p->cgroups will be reset to init_css_set. So, we'll just + * find the root cgroup even if 'p' gets to be obsolete. + */ + mem = mem_cgroup_from_task(p); + VM_BUG_ON(!mem); + if (mem_cgroup_is_root(mem)) { + rcu_read_unlock(); + goto done; + } + if (consume_stock(mem)) { + /* + * It seems dagerous to access memcg without css_get(). + * But considering how consume_stok works, it's not + * necessary. If consume_stock success, some charges + * from this memcg are cached on this cpu. So, we + * don't need to call css_get()/css_tryget() before + * calling consume_stock(). + */ + rcu_read_unlock(); + goto done; + } + if (!css_tryget(&mem->css)) { + rcu_read_unlock(); + goto again; + } + rcu_read_unlock(); + } do { bool oom_check; - if (consume_stock(mem)) - goto done; /* don't need to fill stock */ /* If killed, bypass charge */ - if (fatal_signal_pending(current)) + if (fatal_signal_pending(current)) { + css_put(&mem->css); goto bypass; + } oom_check = false; if (oom && !nr_oom_retries) { @@ -1753,30 +1786,36 @@ static int __mem_cgroup_try_charge(struc break; case CHARGE_RETRY: /* not in OOM situation but retry */ csize = PAGE_SIZE; - break; + css_put(&mem->css); + mem = NULL; + goto again; case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */ + css_put(&mem->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) + if (!oom) { + css_put(&mem->css); goto nomem; + } /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; case CHARGE_OOM_DIE: /* Killed by OOM Killer */ + css_put(&mem->css); goto bypass; } } while (ret != CHARGE_OK); if (csize > PAGE_SIZE) refill_stock(mem, csize - PAGE_SIZE); + css_put(&mem->css); done: + *memcg = mem; return 0; nomem: - css_put(&mem->css); + *memcg = NULL; return -ENOMEM; bypass: - if (mem) - css_put(&mem->css); *memcg = NULL; return 0; } @@ -1793,11 +1832,7 @@ static void __mem_cgroup_cancel_charge(s res_counter_uncharge(&mem->res, PAGE_SIZE * count); if (do_swap_account) res_counter_uncharge(&mem->memsw, PAGE_SIZE * count); - VM_BUG_ON(test_bit(CSS_ROOT, &mem->css.flags)); - WARN_ON_ONCE(count > INT_MAX); - __css_put(&mem->css, (int)count); } - /* we don't need css_put for root */ } static void mem_cgroup_cancel_charge(struct mem_cgroup *mem) @@ -2158,7 +2193,6 @@ int mem_cgroup_try_charge_swapin(struct goto charge_cur_mm; *ptr = mem; ret = __mem_cgroup_try_charge(NULL, mask, ptr, true); - /* drop extra refcnt from tryget */ css_put(&mem->css); return ret; charge_cur_mm: @@ -2345,9 +2379,6 @@ __mem_cgroup_uncharge_common(struct page unlock_page_cgroup(pc); memcg_check_events(mem, page); - /* at swapout, this memcg will be accessed to record to swap */ - if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT) - css_put(&mem->css); return mem; @@ -2432,15 +2463,18 @@ mem_cgroup_uncharge_swapcache(struct pag if (!swapout) /* this was a swap cache but the swap is unused ! */ ctype = MEM_CGROUP_CHARGE_TYPE_DROP; - memcg = __mem_cgroup_uncharge_common(page, ctype); + memcg = try_get_mem_cgroup_from_page(page); + if (!memcg) + return; + + __mem_cgroup_uncharge_common(page, ctype); /* record memcg information */ - if (do_swap_account && swapout && memcg) { + if (do_swap_account && swapout) { swap_cgroup_record(ent, css_id(&memcg->css)); mem_cgroup_get(memcg); } - if (swapout && memcg) - css_put(&memcg->css); + css_put(&memcg->css); } #endif @@ -2518,7 +2552,6 @@ static int mem_cgroup_move_swap_account( */ if (!mem_cgroup_is_root(to)) res_counter_uncharge(&to->res, PAGE_SIZE); - css_put(&to->css); } return 0; } @@ -4219,7 +4252,6 @@ static int mem_cgroup_do_precharge(unsig mc.precharge += count; VM_BUG_ON(test_bit(CSS_ROOT, &mem->css.flags)); WARN_ON_ONCE(count > INT_MAX); - __css_get(&mem->css, (int)count); return ret; } one_by_one: @@ -4469,8 +4501,6 @@ static void mem_cgroup_clear_mc(void) */ res_counter_uncharge(&mc.to->res, PAGE_SIZE * mc.moved_swap); - VM_BUG_ON(test_bit(CSS_ROOT, &mc.to->css.flags)); - __css_put(&mc.to->css, mc.moved_swap); } /* we've already done mem_cgroup_get(mc.to) */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>