On Wed, Apr 14, 2010 at 11:54 PM, KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote: > On Thu, 15 Apr 2010 15:21:04 +0900 > Daisuke Nishimura <nishimura@xxxxxxxxxxxxxxxxx> wrote: >> > The only reason to use trylock in this case is to prevent deadlock >> > when running in a context that may have preempted or interrupted a >> > routine that already holds the bit locked. In the >> > __remove_from_page_cache() irqs are disabled, but that does not imply >> > that a routine holding the spinlock has been preempted. When the bit >> > is locked, preemption is disabled. The only way to interrupt a holder >> > of the bit for an interrupt to occur (I/O, timer, etc). So I think >> > that in_interrupt() is sufficient. Am I missing something? >> > >> IIUC, it's would be enough to prevent deadlock where one CPU tries to acquire >> the same page cgroup lock. But there is still some possibility where 2 CPUs >> can cause dead lock each other(please see the commit e767e056). >> IOW, my point is "don't call lock_page_cgroup() under mapping->tree_lock". >> > Hmm, maybe worth to try. We may be able to set/clear all DIRTY/WRITBACK bit > on page_cgroup without mapping->tree_lock. > In such case, of course, the page itself should be locked by lock_page(). > > But.Hmm..for example. > > account_page_dirtied() is the best place to mark page_cgroup dirty. But > it's called under mapping->tree_lock. > > Another thinking: > I wonder we may have to change our approach for dirty page acccounting. > > Please see task_dirty_inc(). It's for per task dirty limiting. > And you'll notice soon that there is no task_dirty_dec(). Hello Kame-san, This is an interesting idea. If this applies to memcg dirty accounting, then would it also apply to system-wide dirty accounting? I don't think so, but I wanted to float the idea. It looks like this proportions.c code is good is at comparing the rates of events (for example: per-task dirty page events). However, in the case of system-wide dirty accounting we also want to consider the amount of dirty memory, not just the rate at which it is being dirtied. Cgroup dirty page accounting imposes the following additional accounting complexities: * hierarchical accounting * page migration between cgroups For per-memcg dirty accounting, are you thinking that each mem_cgroup would have a struct prop_local_single to represent a memcg's dirty memory usage relative to a system wide prop_descriptor? My concern is that we will still need an efficient way to determine the mem_cgroup associated with a page under a variety of conditions (page fault path for new mappings, softirq for dirty page writeback). Currently -rc4 and -mmotm use a non-irq safe lock_page_cgroup() to protect a page's cgroup membership. I think this will cause problems as we add more per-cgroup stats (dirty page counts, etc) that are adjusted in irq handlers. Proposed approaches include: 1. use try-style locking. this can lead to fuzzy counters, which some do not like. Over time these fuzzy counter may drift. 2. mask irq when calling lock_page_cgroup(). This has some performance cost, though it may be small (see below). 3. because a page's cgroup membership rarely changes, use RCU locking. This is fast, but may be more complex than we want. The performance of simple irqsave locking or more advanced RCU locking is similar to current locking (non-irqsave/non-rcu) for several workloads (kernel build, dd). Using a micro-benchmark some differences are seen: * irqsave is 1% slower than mmotm non-irqsave/non-rcu locking. * RCU locking is 4% faster than mmotm non-irqsave/non-rcu locking. * RCU locking is 5% faster than irqsave locking. I think we need some changes to per-memcg dirty page accounting updates from irq handlers. If we want to focus micro benchmark performance, then RCU locking seems like the correct approach. Otherwise, irqsave locking seems adequate. I'm thinking that for now we should start simple and use irqsave. Comments? Here's the data I collected... config kernel_build[1] dd[2] read-fault[3] =================================================== 2.6.34-rc4 4:18.64, 4:56.06(+-0.190%) MEMCG=n 0.276(+-1.298%), 0.532(+-0.808%), 2.659(+-0.869%) 3753.6(+-0.105%) 2.6.34-rc4 4:19.60, 4:58.29(+-0.184%) MEMCG=y 0.288(+-0.663%), 0.599(+-1.857%), 2.841(+-1.020%) root cgroup 4172.3(+-0.074%) 2.6.34-rc4 5:02.41, 4:58.56(+-0.116%) MEMCG=y 0.288(+-0.978%), 0.571(+-1.005%), 2.898(+-1.052%) non-root cgroup 4162.8(+-0.317%) 2.6.34-rc4 4:21.02, 4:57.27(+-0.152%) MEMCG=y 0.289(+-0.809%), 0.574(+-1.013%), 2.856(+-0.909%) mmotm 4159.0(+-0.280%) root cgroup 2.6.34-rc4 5:01.13, 4:56.84(+-0.074%) MEMCG=y 0.299(+-1.512%), 0.577(+-1.158%), 2.864(+-1.012%) mmotm 4202.3(+-0.149%) non-root cgroup 2.6.34-rc4 4:19.44, 4:57.30(+-0.151%) MEMCG=y 0.293(+-0.885%), 0.578(+-0.967%), 2.878(+-1.026%) mmotm 4219.1(+-0.007%) irqsave locking root cgroup 2.6.34-rc4 5:01.07, 4:58.62(+-0.796%) MEMCG=y 0.305(+-1.752%), 0.579(+-1.035%), 2.893(+-1.111%) mmotm 4254.3(+-0.095%) irqsave locking non-root cgroup 2.6.34-rc4 4:19.53, 4:58.74(+-0.840%) MEMCG=y 0.291(+-0.394%), 0.577(+-1.219%), 2.868(+-1.033%) mmotm 4004.4(+-0.059%) RCU locking root cgroup 2.6.34-rc4 5:00.99, 4:57.04(+-0.069%) MEMCG=y 0.289(+-1.027%), 0.575(+-1.069%), 2.858(+-1.102%) mmotm 4004.0(+-0.096%) RCU locking non-root cgroup [1] kernel build is listed as two numbers, first build is cache cold, and average of three non-first builds (with warm cache). src and output are in 2G tmpfs. [2] dd creates 10x files in tmpfs of various sizes (100M,200M,1000M) using: "dd if=/dev/zero bs=$((1<<20)) count=..." [3] micro benchmark measures cycles (rdtsc) per read fault of mmap-ed file warm in the page cache. [4] MEMCG= is an abberviation for CONFIG_CGROUP_MEM_RES_CTLR= [5] mmotm is dated 2010-04-15-14-42 [6] irqsave locking converts all [un]lock_page_cgroup() to use local_irq_save/restore(). (local commit a7f01d96417b10058a2128751fe4062e8a3ecc53). This was previously proposed on linux-kernel and linux-mm. [7] RCU locking patch is shown below. (local commit 231a4fec6ccdef9e630e184c0e0527c884eac57d) For reference, here's the RCU locking patch for 2010-04-15-14-42 mmotm, which patches 2.6.34-rc4. Use RCU to avoid lock_page_cgroup() in most situations. When locking, disable irq to allow for accounting from irq handlers. diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ee3b52f..cd46474 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -280,6 +280,12 @@ static bool move_file(void) } /* + * If accounting changes are underway, then access to the mem_cgroup field + * within struct page_cgroup requires locking. + */ +static bool mem_cgroup_account_move_ongoing; + +/* * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft * limit reclaim to prevent infinite loops, if they ever occur. */ @@ -1436,12 +1442,25 @@ void mem_cgroup_update_file_mapped(struct page *page, int val) { struct mem_cgroup *mem; struct page_cgroup *pc; + bool locked = false; + unsigned long flags = 0; pc = lookup_page_cgroup(page); if (unlikely(!pc)) return; - lock_page_cgroup(pc); + /* + * Unless a page's cgroup reassignment is possible, then avoid grabbing + * the lock used to protect the cgroup assignment. + */ + rcu_read_lock(); + smp_rmb(); + if (unlikely(mem_cgroup_account_move_ongoing)) { + local_irq_save(flags); + lock_page_cgroup(pc); + locked = true; + } + mem = pc->mem_cgroup; if (!mem || !PageCgroupUsed(pc)) goto done; @@ -1449,6 +1468,7 @@ void mem_cgroup_update_file_mapped(struct page *page, int val) /* * Preemption is already disabled. We can use __this_cpu_xxx */ + VM_BUG_ON(preemptible()); if (val > 0) { __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]); SetPageCgroupFileMapped(pc); @@ -1458,7 +1478,11 @@ void mem_cgroup_update_file_mapped(struct page *page, int val) } done: - unlock_page_cgroup(pc); + if (unlikely(locked)) { + unlock_page_cgroup(pc); + local_irq_restore(flags); + } + rcu_read_unlock(); } /* @@ -2498,6 +2522,28 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry, #endif /* + * Reassignment of mem_cgroup is possible, so locking is required. Make sure + * that locks are used when accessing mem_cgroup. + * mem_cgroup_end_page_cgroup_reassignment() balances this function. + */ +static void mem_cgroup_begin_page_cgroup_reassignment(void) +{ + VM_BUG_ON(mem_cgroup_account_move_ongoing); + mem_cgroup_account_move_ongoing = true; + synchronize_rcu(); +} + +/* + * Once page cgroup membership changes complete, this routine indicates that + * access to mem_cgroup does not require locks. + */ +static void mem_cgroup_end_page_cgroup_reassignment(void) +{ + VM_BUG_ON(! mem_cgroup_end_page_cgroup_reassignment); + mem_cgroup_account_move_ongoing = false; +} + +/* * Before starting migration, account PAGE_SIZE to mem_cgroup that the old * page belongs to. */ @@ -2524,6 +2570,10 @@ int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr) css_put(&mem->css); } *ptr = mem; + + if (!ret) + mem_cgroup_begin_page_cgroup_reassignment(); + return ret; } @@ -2536,7 +2586,8 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem, enum charge_type ctype; if (!mem) - return; + goto unlock; + cgroup_exclude_rmdir(&mem->css); /* at migration success, oldpage->mapping is NULL. */ if (oldpage->mapping) { @@ -2583,6 +2634,9 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem, * In that case, we need to call pre_destroy() again. check it here. */ cgroup_release_and_wakeup_rmdir(&mem->css); + +unlock: + mem_cgroup_end_page_cgroup_reassignment(); } /* @@ -4406,6 +4460,8 @@ static void mem_cgroup_clear_mc(void) mc.to = NULL; mc.moving_task = NULL; wake_up_all(&mc.waitq); + + mem_cgroup_end_page_cgroup_reassignment(); } static int mem_cgroup_can_attach(struct cgroup_subsys *ss, @@ -4440,6 +4496,8 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss, mc.moved_swap = 0; mc.moving_task = current; + mem_cgroup_begin_page_cgroup_reassignment(); + ret = mem_cgroup_precharge_mc(mm); if (ret) mem_cgroup_clear_mc(); -- Greg -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href