On Sat, Dec 12, 2015 at 04:34:02PM +0300, Vladimir Davydov wrote: > Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break() > once enough pages have been reclaimed, in which case, in contrast to a > full round-trip over a cgroup sub-tree, the current position stored in > mem_cgroup_reclaim_iter of the target cgroup does not get invalidated > and so is left holding the reference to the last scanned cgroup. If the > target cgroup does not get scanned again (we might have just reclaimed > the last page or all processes might exit and free their memory > voluntary), we will leak it, because there is nobody to put the > reference held by the iterator. > > The problem is easy to reproduce by running the following command > sequence in a loop: > > mkdir /sys/fs/cgroup/memory/test > echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes > echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs > memhog 150M > echo $$ > /sys/fs/cgroup/memory/cgroup.procs > rmdir test > > The cgroups generated by it will never get freed. > > This patch fixes this issue by making mem_cgroup_iter_break() clear > mem_cgroup_reclaim_iter->position in case it points to the memory cgroup > we interrupted reclaim on. > > Fixes: 5ac8fb31ad2e ("mm: memcontrol: convert reclaim iterator to simple css refcounting") > Signed-off-by: Vladimir Davydov <vdavydov@xxxxxxxxxxxxx> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx> > Cc: Michal Hocko <mhocko@xxxxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> # 3.19+ Good catch! The downside of not remembering the last position across incomplete reclaim cycles is that we always restart at the same position. If a cgroup has a certain number of children, it's conceivable that we might never get to the youngest cgroups in the subtree anymore. So I'd be more comfortable removing incomplete reclaim walks. It was a nice little optimization we could do while supporting interleave walks, but it's not necessary: when global reclaim can walk the entire system, limit reclaim should be okay walking subtrees exhaustively. How about this? diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 34f4a14..62a4731 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -933,6 +933,9 @@ out: * mem_cgroup_iter_break - abort a hierarchy walk prematurely * @root: hierarchy root * @prev: last visited hierarchy member as returned by mem_cgroup_iter() + * + * Note: do not use this from a shared iterator walk (when using a + * reclaim cookie), the iterator state may leak the css reference! */ void mem_cgroup_iter_break(struct mem_cgroup *root, struct mem_cgroup *prev) diff --git a/mm/vmscan.c b/mm/vmscan.c index 2dbc679..41b07b2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2425,21 +2425,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, sc->nr_scanned - scanned, sc->nr_reclaimed - reclaimed); - /* - * Direct reclaim and kswapd have to scan all memory - * cgroups to fulfill the overall scan target for the - * zone. - * - * Limit reclaim, on the other hand, only cares about - * nr_to_reclaim pages to be reclaimed and it will - * retry with decreasing priority if one round over the - * whole hierarchy is not sufficient. - */ - if (!global_reclaim(sc) && - sc->nr_reclaimed >= sc->nr_to_reclaim) { - mem_cgroup_iter_break(root, memcg); - break; - } } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>