(2013/02/14 22:26), Michal Hocko wrote: > Now that per-node-zone-priority iterator caches memory cgroups rather > than their css ids we have to be careful and remove them from the > iterator when they are on the way out otherwise they might live for > unbounded amount of time even though their group is already gone (until > the global/targeted reclaim triggers the zone under priority to find out > the group is dead and let it to find the final rest). > > We can fix this issue by relaxing rules for the last_visited memcg. > Instead of taking a reference to the css before it is stored into > iter->last_visited we can just store its pointer and track the number of > removed groups from each memcg's subhierarchy. > > This number would be stored into iterator everytime when a memcg is > cached. If the iter count doesn't match the curent walker root's one we > will start from the root again. The group counter is incremented upwards > the hierarchy every time a group is removed. > > The iter_lock can be dropped because racing iterators cannot leak > the reference anymore as the reference count is not elevated for > last_visited when it is cached. > > Locking rules got a bit complicated by this change though. The iterator > primarily relies on rcu read lock which makes sure that once we see > a valid last_visited pointer then it will be valid for the whole RCU > walk. smp_rmb makes sure that dead_count is read before last_visited > and last_dead_count while smp_wmb makes sure that last_visited is > updated before last_dead_count so the up-to-date last_dead_count cannot > point to an outdated last_visited. css_tryget then makes sure that > the last_visited is still alive in case the iteration races with the > cached group removal (css is invalidated before mem_cgroup_css_offline > increments dead_count). > > In short: > mem_cgroup_iter > rcu_read_lock() > dead_count = atomic_read(parent->dead_count) > smp_rmb() > if (dead_count != iter->last_dead_count) > last_visited POSSIBLY INVALID -> last_visited = NULL > if (!css_tryget(iter->last_visited)) > last_visited DEAD -> last_visited = NULL > next = find_next(last_visited) > css_tryget(next) > css_put(last_visited) // css would be invalidated and parent->dead_count > // incremented if this was the last reference > iter->last_visited = next > smp_wmb() > iter->last_dead_count = dead_count > rcu_read_unlock() > > cgroup_rmdir > cgroup_destroy_locked > atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail > mem_cgroup_css_offline > mem_cgroup_invalidate_reclaim_iterators > while(parent = parent_mem_cgroup) > atomic_inc(parent->dead_count) > css_put(css) // last reference held by cgroup core > > Spotted-by: Ying Han <yinghan@xxxxxxxxxx> > Original-idea-by: Johannes Weiner <hannes@xxxxxxxxxxx> > Signed-off-by: Michal Hocko <mhocko@xxxxxxx> interesting. Thank you for your hard works. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>