On Tue 09-08-11 18:32:16, KAMEZAWA Hiroyuki wrote: > On Tue, 9 Aug 2011 11:31:50 +0200 > Michal Hocko <mhocko@xxxxxxx> wrote: > > > What do you think about the half backed patch bellow? I didn't manage to > > test it yet but I guess it should help. I hate asymmetry of drain_lock > > locking (it is acquired somewhere else than it is released which is > > not). I will think about a nicer way how to do it. > > Maybe I should also split the rcu part in a separate patch. > > > > What do you think? > > > I'd like to revert 8521fc50 first and consider total design change > rather than ad-hoc fix. Agreed. Revert should go into 3.0 stable as well. Although the global mutex is buggy we have that behavior for a long time without any reports. We should address it but it can wait for 3.2. > Personally, I don't like to have spin-lock in per-cpu area. spinlock is not that different from what we already have with the bit lock. > > > Thanks, > -Kame > > > --- > > From 26c2cdc55aa14ec4a54e9c8e2c8b9072c7cb8e28 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko <mhocko@xxxxxxx> > > Date: Tue, 9 Aug 2011 10:53:28 +0200 > > Subject: [PATCH] memcg: fix drain_all_stock crash > > > > 8521fc50 (memcg: get rid of percpu_charge_mutex lock) introduced a crash > > in sync mode when we are about to check whether we have to wait for the > > work because we are calling mem_cgroup_same_or_subtree without checking > > FLUSHING_CACHED_CHARGE before so we can dereference already cleaned > > cache (the simplest case would be when we drain the local cache). > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 > > IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70 > > PGD 4ae7a067 PUD 4adc4067 PMD 0 > > Oops: 0000 [#1] PREEMPT SMP > > CPU 0 > > Pid: 19677, comm: rmdir Tainted: G W 3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3 > > RIP: 0010:[<ffffffff81083b70>] [<ffffffff81083b70>] css_is_ancestor+0x20/0x70 > > RSP: 0018:ffff880077b09c88 EFLAGS: 00010202 > > RAX: ffff8800781bb310 RBX: 0000000000000000 RCX: 000000000000003e > > RDX: 0000000000000000 RSI: ffff8800779f7c00 RDI: 0000000000000000 > > RBP: ffff880077b09c98 R08: ffffffff818a4e88 R09: 0000000000000000 > > R10: 0000000000000000 R11: dead000000100100 R12: ffff8800779f7c00 > > R13: ffff8800779f7c00 R14: 0000000000000000 R15: ffff88007bc0eb80 > > FS: 00007f5d689ec720(0000) GS:ffff88007bc00000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > CR2: 0000000000000018 CR3: 000000004ad57000 CR4: 00000000000006f0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310) > > Stack: > > ffffffff818a4e88 000000000000eb80 ffff880077b09ca8 ffffffff810feba3 > > ffff880077b09d08 ffffffff810feccf ffff880077b09cf8 0000000000000001 > > ffff88007bd0eb80 0000000000000001 ffff880077af2000 0000000000000000 > > Call Trace: > > [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40 > > [<ffffffff810feccf>] drain_all_stock+0x11f/0x170 > > [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0 > > [<ffffffff81111872>] ? path_put+0x22/0x30 > > [<ffffffff8111c925>] ? __d_lookup+0xb5/0x170 > > [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20 > > [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500 > > [<ffffffff81063990>] ? abort_exclusive_wait+0xb0/0xb0 > > [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0 > > [<ffffffff811233d3>] ? mnt_want_write+0x43/0x80 > > [<ffffffff81114e7b>] do_rmdir+0xfb/0x110 > > [<ffffffff81114ea6>] sys_rmdir+0x16/0x20 > > [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b > > > > Testing FLUSHING_CACHED_CHARGE before dereferencing is still not enough > > because then we still might see mem == NULL so we have to check it > > before dereferencing. > > We have to do all stock checking under FLUSHING_CACHED_CHARGE bit lock > > so it is much easier to use a spin_lock instead. Let's also add a flag > > (under_drain) that draining in progress so that concurrent callers do > > not have to wait on the lock pointlessly. > > > > Finally we do not make sure that the mem still exists. It could have > > been removed in the meantime: > > CPU0 CPU1 CPU2 > > mem=stock->cached > > stock->cached=NULL > > clear_bit > > test_and_set_bit > > test_bit() ... > > <preempted> mem_cgroup_destroy > > use after free > > > > `...' is actually quite a bunch of work to do so the race is not very > > probable. The important thing, though, is that cgroup_subsys->destroy > > (mem_cgroup_destroy) is called after synchronize_rcu so we can protect > > by calling rcu_read_lock when dereferencing cached mem. > > > > TODO: > > - check if under_drain needs some memory barriers > > - check the hotplug path (can we wait on spinlock?) > > - better changelog > > - do some testing > > > > Signed-off-by: Michal Hocko <mhocko@xxxxxxx> > > Reported-by: Johannes Weiner <jweiner@xxxxxxxxxx> > > --- > > mm/memcontrol.c | 67 +++++++++++++++++++++++++++++++++++++++++------------- > > 1 files changed, 51 insertions(+), 16 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index f4ec4e7..e34f9fd 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2087,8 +2087,8 @@ struct memcg_stock_pcp { > > struct mem_cgroup *cached; /* this never be root cgroup */ > > unsigned int nr_pages; > > struct work_struct work; > > - unsigned long flags; > > -#define FLUSHING_CACHED_CHARGE (0) > > + spinlock_t drain_lock; /* protects from parallel draining */ > > + bool under_drain; > > }; > > static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock); > > > > @@ -2114,6 +2114,7 @@ static bool consume_stock(struct mem_cgroup *mem) > > > > /* > > * Returns stocks cached in percpu to res_counter and reset cached information. > > + * Do not call this directly - use drain_local_stock instead. > > */ > > static void drain_stock(struct memcg_stock_pcp *stock) > > { > > @@ -2133,12 +2134,16 @@ static void drain_stock(struct memcg_stock_pcp *stock) > > /* > > * This must be called under preempt disabled or must be called by > > * a thread which is pinned to local cpu. > > + * Parameter is not used. > > + * Assumes stock->drain_lock held. > > */ > > static void drain_local_stock(struct work_struct *dummy) > > { > > struct memcg_stock_pcp *stock = &__get_cpu_var(memcg_stock); > > drain_stock(stock); > > - clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); > > + > > + stock->under_drain = false; > > + spin_unlock(&stock->drain_lock); > > } > > > > /* > > @@ -2150,7 +2155,9 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages) > > struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock); > > > > if (stock->cached != mem) { /* reset if necessary */ > > - drain_stock(stock); > > + spin_lock(&stock->drain_lock); > > + stock->under_drain = true; > > + drain_local_stock(NULL); > > stock->cached = mem; > > } > > stock->nr_pages += nr_pages; > > @@ -2179,17 +2186,27 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync) > > struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); > > struct mem_cgroup *mem; > > > > + /* > > + * make sure we are not waiting when somebody already drains > > + * the cache. > > + */ > > + if (!spin_trylock(&stock->drain_lock)) { > > + if (stock->under_drain) > > + continue; > > + spin_lock(&stock->drain_lock); > > + } > > mem = stock->cached; > > - if (!mem || !stock->nr_pages) > > + if (!mem || !stock->nr_pages || > > + !mem_cgroup_same_or_subtree(root_mem, mem)) { > > + spin_unlock(&stock->drain_lock); > > continue; > > - if (!mem_cgroup_same_or_subtree(root_mem, mem)) > > - continue; > > - if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) { > > - if (cpu == curcpu) > > - drain_local_stock(&stock->work); > > - else > > - schedule_work_on(cpu, &stock->work); > > } > > + > > + stock->under_drain = true; > > + if (cpu == curcpu) > > + drain_local_stock(&stock->work); > > + else > > + schedule_work_on(cpu, &stock->work); > > } > > > > if (!sync) > > @@ -2197,8 +2214,20 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync) > > > > for_each_online_cpu(cpu) { > > struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); > > - if (mem_cgroup_same_or_subtree(root_mem, stock->cached) && > > - test_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) > > + struct mem_cgroup *mem; > > + bool wait_for_drain = false; > > + > > + /* > > + * we have to be careful about parallel group destroying > > + * (mem_cgroup_destroy) which is derefered after sychronize_rcu > > + */ > > + rcu_read_lock(); > > + mem = stock->cached; > > + wait_for_drain = stock->under_drain && > > + mem && mem_cgroup_same_or_subtree(root_mem, mem); > > + rcu_read_unlock(); > > + > > + if (wait_for_drain) > > flush_work(&stock->work); > > } > > out: > > @@ -2278,8 +2307,12 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb, > > for_each_mem_cgroup_all(iter) > > mem_cgroup_drain_pcp_counter(iter, cpu); > > > > - stock = &per_cpu(memcg_stock, cpu); > > - drain_stock(stock); > > + if (!spin_trylock(&stock->drain_lock)) { > > + if (stock->under_drain) > > + return NOTIFY_OK; > > + spin_lock(&stock->drain_lock); > > + } > > + drain_local_stock(NULL); > > return NOTIFY_OK; > > } > > > > @@ -5068,6 +5101,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) > > struct memcg_stock_pcp *stock = > > &per_cpu(memcg_stock, cpu); > > INIT_WORK(&stock->work, drain_local_stock); > > + stock->under_drain = false; > > + spin_lock_init(&stock->drain_lock); > > } > > hotcpu_notifier(memcg_cpu_hotplug_callback, 0); > > } else { > > -- > > 1.7.5.4 > > > > > > -- > > Michal Hocko > > SUSE Labs > > SUSE LINUX s.r.o. > > Lihovarska 1060/12 > > 190 00 Praha 9 > > Czech Republic > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>