On Wed, Dec 12, 2012 at 11:24 AM, Michal Hocko <mhocko@xxxxxxx> wrote: > On Wed 12-12-12 10:06:52, Michal Hocko wrote: >> On Tue 11-12-12 14:36:10, Ying Han wrote: > [...] >> > One exception is mem_cgroup_iter_break(), where the loop terminates >> > with *leaked* refcnt and that is what the iter_break() needs to clean >> > up. We can not rely on the next caller of the loop since it might >> > never happen. >> >> Yes, this is true and I already have a half baked patch for that. I >> haven't posted it yet but it basically checks all node-zone-prio >> last_visited and removes itself from them on the way out in pre_destroy >> callback (I just need to cleanup "find a new last_visited" part and will >> post it). > > And a half baked patch - just compile tested > --- > From 1c976c079c383175c679e00115aee0ab8e215bf2 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@xxxxxxx> > Date: Tue, 11 Dec 2012 21:02:39 +0100 > Subject: [PATCH] NOT READY YET - just compile tested > > memcg: remove memcg from the reclaim iterators > > Now that per-node-zone-priority iterator caches memory cgroups rather > than their css ids we have to be careful and remove them from the > iterator when they are on the way out otherwise they might hang for > unbounded amount of time (until the global reclaim triggers the zone > under priority to find out the group is dead and let it to find the > final rest). > > This is solved by hooking into mem_cgroup_pre_destroy and checking all > per-node-zone-priority iterators. If the current memcg is found in > iter->last_visited then it is replaced by its left sibling or its parent > otherwise. This guarantees that no group gets more reclaiming than > necessary and the next iteration will continue seemingly. > > Spotted-by: Ying Han <yinghan@xxxxxxxxxx> > Not-signed-off-by-yet: Michal Hocko <mhocko@xxxxxxx> > --- > mm/memcontrol.c | 38 ++++++++++++++++++++++++++++++++++++++ > 1 file changed, 38 insertions(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 7134148..286db74 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6213,12 +6213,50 @@ free_out: > return ERR_PTR(error); > } > > +static void mem_cgroup_remove_cached(struct mem_cgroup *memcg) > +{ > + int node, zone; > + > + for_each_node(node) { > + struct mem_cgroup_per_node *pn = memcg->info.nodeinfo[node]; > + int prio; > + > + for (zone = 0; zone < MAX_NR_ZONES; zone++) { > + struct mem_cgroup_per_zone *mz; > + > + mz = &pn->zoneinfo[zone]; > + for (prio = 0; prio < DEF_PRIORITY + 1; prio++) { > + struct mem_cgroup_reclaim_iter *iter; > + > + iter = &mz->reclaim_iter[prio]; > + rcu_read_lock(); > + spin_lock(&iter->iter_lock); > + if (iter->last_visited == memcg) { > + struct cgroup *cgroup, *prev; > + > + cgroup = memcg->css.cgroup; > + prev = list_entry_rcu(cgroup->sibling.prev, struct cgroup, sibling); > + if (&prev->sibling == &prev->parent->children) > + prev = prev->parent; > + iter->last_visited = mem_cgroup_from_cont(prev); > + > + /* TODO can we do this? */ > + css_put(&memcg->css); > + } > + spin_unlock(&iter->iter_lock); > + rcu_read_unlock(); > + } > + } > + } > +} > + > static void mem_cgroup_pre_destroy(struct cgroup *cont) > { > struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); > > mem_cgroup_reparent_charges(memcg); > mem_cgroup_destroy_all_caches(memcg); > + mem_cgroup_remove_cached(memcg); > } > > static void mem_cgroup_destroy(struct cgroup *cont) > -- > 1.7.10.4 > > -- > Michal Hocko > SUSE Labs I haven't tried this patch set yet. Before I am doing that, I am curious whether changing the target reclaim to be consistent with global reclaim something worthy to consider based my last reply: diff --git a/mm/vmscan.c b/mm/vmscan.c index 53dcde9..3f158c5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1911,20 +1911,6 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) shrink_lruvec(lruvec, sc); - /* - * Limit reclaim has historically picked one memcg and - * scanned it with decreasing priority levels until - * nr_to_reclaim had been reclaimed. This priority - * cycle is thus over after a single memcg. - * - * Direct reclaim and kswapd, on the other hand, have - * to scan all memory cgroups to fulfill the overall - * scan target for the zone. - */ - if (!global_reclaim(sc)) { - mem_cgroup_iter_break(root, memcg); - break; - } memcg = mem_cgroup_iter(root, memcg, &reclaim); } while (memcg); } --Ying -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>