On Fri, 23 Apr 2010 13:17:38 -0700 Greg Thelen <gthelen@xxxxxxxxxx> wrote: > On Wed, Apr 14, 2010 at 11:54 PM, KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote: > > On Thu, 15 Apr 2010 15:21:04 +0900 > > Daisuke Nishimura <nishimura@xxxxxxxxxxxxxxxxx> wrote: > >> > The only reason to use trylock in this case is to prevent deadlock > >> > when running in a context that may have preempted or interrupted a > >> > routine that already holds the bit locked. In the > >> > __remove_from_page_cache() irqs are disabled, but that does not imply > >> > that a routine holding the spinlock has been preempted. When the bit > >> > is locked, preemption is disabled. The only way to interrupt a holder > >> > of the bit for an interrupt to occur (I/O, timer, etc). So I think > >> > that in_interrupt() is sufficient. Am I missing something? > >> > > >> IIUC, it's would be enough to prevent deadlock where one CPU tries to acquire > >> the same page cgroup lock. But there is still some possibility where 2 CPUs > >> can cause dead lock each other(please see the commit e767e056). > >> IOW, my point is "don't call lock_page_cgroup() under mapping->tree_lock". > >> > > Hmm, maybe worth to try. We may be able to set/clear all DIRTY/WRITBACK bit > > on page_cgroup without mapping->tree_lock. > > In such case, of course, the page itself should be locked by lock_page(). > > > > But.Hmm..for example. > > > > account_page_dirtied() is the best place to mark page_cgroup dirty. But > > it's called under mapping->tree_lock. > > > > Another thinking: > > I wonder we may have to change our approach for dirty page acccounting. > > > > Please see task_dirty_inc(). It's for per task dirty limiting. > > And you'll notice soon that there is no task_dirty_dec(). > > Hello Kame-san, > > This is an interesting idea. If this applies to memcg dirty accounting, > then would it also apply to system-wide dirty accounting? I don't think > so, but I wanted to float the idea. It looks like this proportions.c > code is good is at comparing the rates of events (for example: per-task > dirty page events). However, in the case of system-wide dirty > accounting we also want to consider the amount of dirty memory, not just > the rate at which it is being dirtied. > > Cgroup dirty page accounting imposes the following additional accounting > complexities: > * hierarchical accounting > * page migration between cgroups > > For per-memcg dirty accounting, are you thinking that each mem_cgroup > would have a struct prop_local_single to represent a memcg's dirty > memory usage relative to a system wide prop_descriptor? > > My concern is that we will still need an efficient way to determine the > mem_cgroup associated with a page under a variety of conditions (page > fault path for new mappings, softirq for dirty page writeback). > > Currently -rc4 and -mmotm use a non-irq safe lock_page_cgroup() to > protect a page's cgroup membership. I think this will cause problems as > we add more per-cgroup stats (dirty page counts, etc) that are adjusted > in irq handlers. Proposed approaches include: > 1. use try-style locking. this can lead to fuzzy counters, which some > do not like. Over time these fuzzy counter may drift. > > 2. mask irq when calling lock_page_cgroup(). This has some performance > cost, though it may be small (see below). > > 3. because a page's cgroup membership rarely changes, use RCU locking. > This is fast, but may be more complex than we want. > > The performance of simple irqsave locking or more advanced RCU locking > is similar to current locking (non-irqsave/non-rcu) for several > workloads (kernel build, dd). Using a micro-benchmark some differences > are seen: > * irqsave is 1% slower than mmotm non-irqsave/non-rcu locking. > * RCU locking is 4% faster than mmotm non-irqsave/non-rcu locking. > * RCU locking is 5% faster than irqsave locking. > > I think we need some changes to per-memcg dirty page accounting updates > from irq handlers. If we want to focus micro benchmark performance, > then RCU locking seems like the correct approach. Otherwise, irqsave > locking seems adequate. I'm thinking that for now we should start > simple and use irqsave. Comments? > > Here's the data I collected... > > config kernel_build[1] dd[2] read-fault[3] > =================================================== > 2.6.34-rc4 4:18.64, 4:56.06(+-0.190%) > MEMCG=n 0.276(+-1.298%), 0.532(+-0.808%), 2.659(+-0.869%) > 3753.6(+-0.105%) > > 2.6.34-rc4 4:19.60, 4:58.29(+-0.184%) > MEMCG=y 0.288(+-0.663%), 0.599(+-1.857%), 2.841(+-1.020%) > root cgroup 4172.3(+-0.074%) > > 2.6.34-rc4 5:02.41, 4:58.56(+-0.116%) > MEMCG=y 0.288(+-0.978%), 0.571(+-1.005%), 2.898(+-1.052%) > non-root cgroup 4162.8(+-0.317%) > > 2.6.34-rc4 4:21.02, 4:57.27(+-0.152%) > MEMCG=y 0.289(+-0.809%), 0.574(+-1.013%), 2.856(+-0.909%) > mmotm 4159.0(+-0.280%) > root cgroup > > 2.6.34-rc4 5:01.13, 4:56.84(+-0.074%) > MEMCG=y 0.299(+-1.512%), 0.577(+-1.158%), 2.864(+-1.012%) > mmotm 4202.3(+-0.149%) > non-root cgroup > > 2.6.34-rc4 4:19.44, 4:57.30(+-0.151%) > MEMCG=y 0.293(+-0.885%), 0.578(+-0.967%), 2.878(+-1.026%) > mmotm 4219.1(+-0.007%) > irqsave locking > root cgroup > > 2.6.34-rc4 5:01.07, 4:58.62(+-0.796%) > MEMCG=y 0.305(+-1.752%), 0.579(+-1.035%), 2.893(+-1.111%) > mmotm 4254.3(+-0.095%) > irqsave locking > non-root cgroup > > 2.6.34-rc4 4:19.53, 4:58.74(+-0.840%) > MEMCG=y 0.291(+-0.394%), 0.577(+-1.219%), 2.868(+-1.033%) > mmotm 4004.4(+-0.059%) > RCU locking > root cgroup > > 2.6.34-rc4 5:00.99, 4:57.04(+-0.069%) > MEMCG=y 0.289(+-1.027%), 0.575(+-1.069%), 2.858(+-1.102%) > mmotm 4004.0(+-0.096%) > RCU locking > non-root cgroup > > [1] kernel build is listed as two numbers, first build is cache cold, > and average of three non-first builds (with warm cache). src and > output are in 2G tmpfs. > > [2] dd creates 10x files in tmpfs of various sizes (100M,200M,1000M) using: > "dd if=/dev/zero bs=$((1<<20)) count=..." > > [3] micro benchmark measures cycles (rdtsc) per read fault of mmap-ed > file warm in the page cache. > > [4] MEMCG= is an abberviation for CONFIG_CGROUP_MEM_RES_CTLR= > > [5] mmotm is dated 2010-04-15-14-42 > > [6] irqsave locking converts all [un]lock_page_cgroup() to use > local_irq_save/restore(). > (local commit a7f01d96417b10058a2128751fe4062e8a3ecc53). This was > previously proposed on linux-kernel and linux-mm. > > [7] RCU locking patch is shown below. > (local commit 231a4fec6ccdef9e630e184c0e0527c884eac57d) > > For reference, here's the RCU locking patch for 2010-04-15-14-42 mmotm, > which patches 2.6.34-rc4. > > Use RCU to avoid lock_page_cgroup() in most situations. > > When locking, disable irq to allow for accounting from irq handlers. > I think the direction itself is good. But, about FILE_MAPPED, it's out of control of radix-tree. So, we need lock_page_cgroup/unlock_page_cgroup always for avoiding race with charge/uncharge. And you don't need to call "begin/end reassignment" at page migration. (I'll post a patch for modifing codes around page-migration. It has BUG now.) But you have to call begin/end reassignment at force_empty. It does account move. Thanks, -Kame > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ee3b52f..cd46474 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -280,6 +280,12 @@ static bool move_file(void) > } > > /* > + * If accounting changes are underway, then access to the mem_cgroup field > + * within struct page_cgroup requires locking. > + */ > +static bool mem_cgroup_account_move_ongoing; > + > +/* > * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft > * limit reclaim to prevent infinite loops, if they ever occur. > */ > @@ -1436,12 +1442,25 @@ void mem_cgroup_update_file_mapped(struct page *page, int val) > { > struct mem_cgroup *mem; > struct page_cgroup *pc; > + bool locked = false; > + unsigned long flags = 0; > > pc = lookup_page_cgroup(page); > if (unlikely(!pc)) > return; > > - lock_page_cgroup(pc); > + /* > + * Unless a page's cgroup reassignment is possible, then avoid grabbing > + * the lock used to protect the cgroup assignment. > + */ > + rcu_read_lock(); > + smp_rmb(); > + if (unlikely(mem_cgroup_account_move_ongoing)) { > + local_irq_save(flags); > + lock_page_cgroup(pc); > + locked = true; > + } > + > mem = pc->mem_cgroup; > if (!mem || !PageCgroupUsed(pc)) > goto done; > @@ -1449,6 +1468,7 @@ void mem_cgroup_update_file_mapped(struct page *page, int val) > /* > * Preemption is already disabled. We can use __this_cpu_xxx > */ > + VM_BUG_ON(preemptible()); > if (val > 0) { > __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]); > SetPageCgroupFileMapped(pc); > @@ -1458,7 +1478,11 @@ void mem_cgroup_update_file_mapped(struct page *page, int val) > } > > done: > - unlock_page_cgroup(pc); > + if (unlikely(locked)) { > + unlock_page_cgroup(pc); > + local_irq_restore(flags); > + } > + rcu_read_unlock(); > } > > /* > @@ -2498,6 +2522,28 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry, > #endif > > /* > + * Reassignment of mem_cgroup is possible, so locking is required. Make sure > + * that locks are used when accessing mem_cgroup. > + * mem_cgroup_end_page_cgroup_reassignment() balances this function. > + */ > +static void mem_cgroup_begin_page_cgroup_reassignment(void) > +{ > + VM_BUG_ON(mem_cgroup_account_move_ongoing); > + mem_cgroup_account_move_ongoing = true; > + synchronize_rcu(); > +} > + > +/* > + * Once page cgroup membership changes complete, this routine indicates that > + * access to mem_cgroup does not require locks. > + */ > +static void mem_cgroup_end_page_cgroup_reassignment(void) > +{ > + VM_BUG_ON(! mem_cgroup_end_page_cgroup_reassignment); > + mem_cgroup_account_move_ongoing = false; > +} > + > +/* > * Before starting migration, account PAGE_SIZE to mem_cgroup that the old > * page belongs to. > */ > @@ -2524,6 +2570,10 @@ int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr) > css_put(&mem->css); > } > *ptr = mem; > + > + if (!ret) > + mem_cgroup_begin_page_cgroup_reassignment(); > + > return ret; > } > > @@ -2536,7 +2586,8 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem, > enum charge_type ctype; > > if (!mem) > - return; > + goto unlock; > + > cgroup_exclude_rmdir(&mem->css); > /* at migration success, oldpage->mapping is NULL. */ > if (oldpage->mapping) { > @@ -2583,6 +2634,9 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem, > * In that case, we need to call pre_destroy() again. check it here. > */ > cgroup_release_and_wakeup_rmdir(&mem->css); > + > +unlock: > + mem_cgroup_end_page_cgroup_reassignment(); > } > > /* > @@ -4406,6 +4460,8 @@ static void mem_cgroup_clear_mc(void) > mc.to = NULL; > mc.moving_task = NULL; > wake_up_all(&mc.waitq); > + > + mem_cgroup_end_page_cgroup_reassignment(); > } > > static int mem_cgroup_can_attach(struct cgroup_subsys *ss, > @@ -4440,6 +4496,8 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss, > mc.moved_swap = 0; > mc.moving_task = current; > > + mem_cgroup_begin_page_cgroup_reassignment(); > + > ret = mem_cgroup_precharge_mc(mm); > if (ret) > mem_cgroup_clear_mc(); > > -- > Greg > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxxx For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>