On Tue, Jun 04, 2013 at 03:03:36PM +0200, Michal Hocko wrote: > On Mon 03-06-13 17:44:37, Tejun Heo wrote: > [...] > > @@ -1218,9 +1218,18 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > * is alive. > > */ > > dead_count = atomic_read(&root->dead_count); > > - smp_rmb(); > > + > > last_visited = iter->last_visited; > > if (last_visited) { > > + /* > > + * Paired with smp_wmb() below in this > > + * function. The pair guarantee that > > + * last_visited is more current than > > + * last_dead_count, which may lead to > > + * spurious iteration resets but guarantees > > + * reliable detection of dead condition. > > + */ > > + smp_rmb(); > > if ((dead_count != iter->last_dead_count) || > > !css_tryget(&last_visited->css)) { > > last_visited = NULL; > > I originally had the barrier this way but Johannes pointed out it is not > correct https://lkml.org/lkml/2013/2/11/411 > " > !> + /* > !> + * last_visited might be invalid if some of the group > !> + * downwards was removed. As we do not know which one > !> + * disappeared we have to start all over again from the > !> + * root. > !> + * css ref count then makes sure that css won't > !> + * disappear while we iterate to the next memcg > !> + */ > !> + last_visited = iter->last_visited; > !> + dead_count = atomic_read(&root->dead_count); > !> + smp_rmb(); > ! > !Confused about this barrier, see below. > ! > !As per above, if you remove the iter lock, those lines are mixed up. > !You need to read the dead count first because the writer updates the > !dead count after it sets the new position. That way, if the dead > !count gives the go-ahead, you KNOW that the position cache is valid, > !because it has been updated first. If either the two reads or the two > !writes get reordered, you risk seeing a matching dead count while the > !position cache is stale. > " The original prototype code I sent looked like this: mem_cgroup_iter: rcu_read_lock() if atomic_read(&root->dead_count) == iter->dead_count: smp_rmb() if tryget(iter->position): position = iter->position memcg = find_next(postion) css_put(position) iter->position = memcg smp_wmb() /* Write position cache BEFORE marking it uptodate */ iter->dead_count = atomic_read(&root->dead_count) rcu_read_unlock() iter->last_position is written, THEN iter->last_dead_count is written. So, yes, you "need to read the dead count" first to be sure iter->last_position is uptodate. But iter->last_dead_count, not root->dead_count. I should have caught this in the final submission of your patch :( Tejun's patch is not correct, either. Something like this? diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 010d6c1..92830fa 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1199,7 +1199,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, mz = mem_cgroup_zoneinfo(root, nid, zid); iter = &mz->reclaim_iter[reclaim->priority]; - last_visited = iter->last_visited; if (prev && reclaim->generation != iter->generation) { iter->last_visited = NULL; goto out_unlock; @@ -1217,14 +1216,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, * css_tryget() verifies the cgroup pointed to * is alive. */ + last_visited = NULL; dead_count = atomic_read(&root->dead_count); - smp_rmb(); - last_visited = iter->last_visited; - if (last_visited) { - if ((dead_count != iter->last_dead_count) || - !css_tryget(&last_visited->css)) { + if (dead_count == iter->last_dead_count) { + /* + * The writer below sets the position + * pointer, then the dead count. + * Ensure we read the updated position + * when the dead count matches. + */ + smp_rmb(); + last_visited = iter->last_visited; + if (last_visited && + !css_tryget(&last_visited->css)) last_visited = NULL; - } } } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>