Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup

Konstantin Khlebnikov <khlebnikov@xxxxxxxxxx> · Tue, 21 Feb 2012 10:05:12 +0400

Hugh Dickins wrote:
page_relock_lruvec() is using lookup_page_cgroup(page)->mem_cgroup
to find the memcg, and hence its per-zone lruvec for the page.  We
therefore need to be careful to see the right pc->mem_cgroup: where
is it updated?

In __mem_cgroup_commit_charge(), under lruvec lock whenever lru
care might be needed, lrucare holding the page off lru at that time.

In mem_cgroup_reset_owner(), not under lruvec lock, but before the
page can be visible to others - except compaction or lumpy reclaim,
which ignore the page because it is not yet PageLRU.

In mem_cgroup_split_huge_fixup(), always under lruvec lock.

In mem_cgroup_move_account(), which holds several locks, but an
lruvec lock not among them: yet it still appears to be safe, because
the page has been taken off its old lru and not yet put on the new.

Be particularly careful in compaction's isolate_migratepages() and
vmscan's lumpy handling in isolate_lru_pages(): those approach the
page by its physical location, and so can encounter pages which
would not be found by any logical lookup.  For those cases we have
to change __isolate_lru_page() slightly: it must leave ClearPageLRU
to the caller, because compaction and lumpy cannot safely interfere
with a page until they have first isolated it and then locked lruvec.

Yeah, this is most complicated part. I found one race here, see below.


To the list above we have to add __mem_cgroup_uncharge_common(),
and new function mem_cgroup_reset_uncharged_to_root(): the first
resetting pc->mem_cgroup to root_mem_cgroup when a page off lru is
uncharged, and the second when an uncharged page is taken off lru
(which used to be achieved implicitly with the PageAcctLRU flag).

That's because there's a remote risk that compaction or lumpy reclaim
will spy a page while it has PageLRU set; then it's taken off LRU and
freed, its mem_cgroup torn down and freed, the page reallocated (so
get_page_unless_zero again succeeds); then compaction or lumpy reclaim
reach their page_relock_lruvec, using the stale mem_cgroup for locking.

So long as there's one charge on the mem_cgroup, or a page on one of
its lrus, mem_cgroup_force_empty() cannot succeed and the mem_cgroup
cannot be destroyed.  But when an uncharged page is taken off lru,
or a page off lru is uncharged, it no longer protects its old memcg,
and the one stable root_mem_cgroup must then be used for it.

Signed-off-by: Hugh Dickins<hughd@xxxxxxxxxx>
---
  include/linux/memcontrol.h |    5 ++
  mm/compaction.c            |   36 ++++++-----------
  mm/memcontrol.c            |   45 +++++++++++++++++++--
  mm/swap.c                  |    2
  mm/vmscan.c                |   73 +++++++++++++++++++++++++----------
  5 files changed, 114 insertions(+), 47 deletions(-)

--- mmotm.orig/include/linux/memcontrol.h       2012-02-18 11:57:42.675524592 -0800
+++ mmotm/include/linux/memcontrol.h    2012-02-18 11:57:49.103524745 -0800
@@ -65,6 +65,7 @@ extern int mem_cgroup_cache_charge(struc
  struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
  extern struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec);
  extern void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
+extern void mem_cgroup_reset_uncharged_to_root(struct page *);

  /* For coalescing uncharge for reducing memcg' overhead*/
  extern void mem_cgroup_uncharge_start(void);
@@ -251,6 +252,10 @@ static inline void mem_cgroup_update_lru
  {
  }

+static inline void mem_cgroup_reset_uncharged_to_root(struct page *page)
+{
+}
+
  static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
  {
         return NULL;
--- mmotm.orig/mm/compaction.c  2012-02-18 11:57:42.675524592 -0800
+++ mmotm/mm/compaction.c       2012-02-18 11:57:49.103524745 -0800
@@ -356,28 +356,6 @@ static isolate_migrate_t isolate_migrate
                         continue;
                 }

-               if (!lruvec) {
-                       /*
-                        * We do need to take the lock before advancing to
-                        * check PageLRU etc., but there's no guarantee that
-                        * the page we're peeking at has a stable memcg here.
-                        */
-                       lruvec =&zone->lruvec;
-                       lock_lruvec(lruvec);
-               }
-               if (!PageLRU(page))
-                       continue;
-
-               /*
-                * PageLRU is set, and lru_lock excludes isolation,
-                * splitting and collapsing (collapsing has already
-                * happened if PageLRU is set).
-                */
-               if (PageTransHuge(page)) {
-                       low_pfn += (1<<  compound_order(page)) - 1;
-                       continue;
-               }
-
                 if (!cc->sync)
                         mode |= ISOLATE_ASYNC_MIGRATE;

@@ -386,10 +364,24 @@ static isolate_migrate_t isolate_migrate
                         continue;

                 page_relock_lruvec(page,&lruvec);

Here race with mem_cgroup_move_account() we hold lock for old lruvec,
while move_account() recharge page and put page back into other lruvec.
Thus we see PageLRU(), but below we isolate page from wrong lruvec.

In my patch-set this is fixed with __wait_lru_unlock() [ spin_unlock_wait() ]
in mem_cgroup_move_account()

+               if (unlikely(!PageLRU(page) || PageUnevictable(page) ||
+                                               PageTransHuge(page))) {
+                       /*
+                        * lru_lock excludes splitting a huge page,
+                        * but we cannot hold lru_lock while freeing page.
+                        */
+                       low_pfn += (1<<  compound_order(page)) - 1;
+                       unlock_lruvec(lruvec);
+                       lruvec = NULL;
+                       put_page(page);
+                       continue;
+               }

                 VM_BUG_ON(PageTransCompound(page));

                 /* Successfully isolated */
+               ClearPageLRU(page);
+               mem_cgroup_reset_uncharged_to_root(page);
                 del_page_from_lru_list(page, lruvec, page_lru(page));
                 list_add(&page->lru, migratelist);
                 cc->nr_migratepages++;
--- mmotm.orig/mm/memcontrol.c  2012-02-18 11:57:42.679524592 -0800
+++ mmotm/mm/memcontrol.c       2012-02-18 11:57:49.107524745 -0800
@@ -1069,6 +1069,33 @@ void page_relock_lruvec(struct page *pag
         *lruvp = lruvec;
  }

+void mem_cgroup_reset_uncharged_to_root(struct page *page)
+{
+       struct page_cgroup *pc;
+
+       if (mem_cgroup_disabled())
+               return;
+
+       VM_BUG_ON(PageLRU(page));
+
+       /*
+        * Once an uncharged page is isolated from the mem_cgroup's lru,
+        * it no longer protects that mem_cgroup from rmdir: reset to root.
+        *
+        * __page_cache_release() and release_pages() may be called at
+        * interrupt time: we cannot lock_page_cgroup() then (we might
+        * have interrupted a section with page_cgroup already locked),
+        * nor do we need to since the page is frozen and about to be freed.
+        */
+       pc = lookup_page_cgroup(page);
+       if (page_count(page))
+               lock_page_cgroup(pc);
+       if (!PageCgroupUsed(pc)&&  pc->mem_cgroup != root_mem_cgroup)
+               pc->mem_cgroup = root_mem_cgroup;
+       if (page_count(page))
+               unlock_page_cgroup(pc);
+}
+
  /**
   * mem_cgroup_update_lru_size - account for adding or removing an lru page
   * @lruvec: mem_cgroup per zone lru vector
@@ -2865,6 +2892,7 @@ __mem_cgroup_uncharge_common(struct page
         struct mem_cgroup *memcg = NULL;
         unsigned int nr_pages = 1;
         struct page_cgroup *pc;
+       struct lruvec *lruvec;
         bool anon;

         if (mem_cgroup_disabled())
@@ -2884,6 +2912,7 @@ __mem_cgroup_uncharge_common(struct page
         if (unlikely(!PageCgroupUsed(pc)))
                 return NULL;

+       lruvec = page_lock_lruvec(page);
         lock_page_cgroup(pc);

         memcg = pc->mem_cgroup;
@@ -2915,14 +2944,17 @@ __mem_cgroup_uncharge_common(struct page
         mem_cgroup_charge_statistics(memcg, anon, -nr_pages);

         ClearPageCgroupUsed(pc);
+
         /*
-        * pc->mem_cgroup is not cleared here. It will be accessed when it's
-        * freed from LRU. This is safe because uncharged page is expected not
-        * to be reused (freed soon). Exception is SwapCache, it's handled by
-        * special functions.
+        * Once an uncharged page is isolated from the mem_cgroup's lru,
+        * it no longer protects that mem_cgroup from rmdir: reset to root.
          */
+       if (!PageLRU(page)&&  pc->mem_cgroup != root_mem_cgroup)
+               pc->mem_cgroup = root_mem_cgroup;

         unlock_page_cgroup(pc);
+       unlock_lruvec(lruvec);
+
         /*
          * even after unlock, we have memcg->res.usage here and this memcg
          * will never be freed.
@@ -2939,6 +2971,7 @@ __mem_cgroup_uncharge_common(struct page

  unlock_out:
         unlock_page_cgroup(pc);
+       unlock_lruvec(lruvec);
         return NULL;
  }

@@ -3327,7 +3360,9 @@ static struct page_cgroup *lookup_page_c
          * the first time, i.e. during boot or memory hotplug;
          * or when mem_cgroup_disabled().
          */
-       if (likely(pc)&&  PageCgroupUsed(pc))
+       if (!pc || PageCgroupUsed(pc))
+               return pc;
+       if (pc->mem_cgroup&&  pc->mem_cgroup != root_mem_cgroup)
                 return pc;
         return NULL;
  }
--- mmotm.orig/mm/swap.c        2012-02-18 11:57:42.679524592 -0800
+++ mmotm/mm/swap.c     2012-02-18 11:57:49.107524745 -0800
@@ -52,6 +52,7 @@ static void __page_cache_release(struct
                 lruvec = page_lock_lruvec(page);
                 VM_BUG_ON(!PageLRU(page));
                 __ClearPageLRU(page);
+               mem_cgroup_reset_uncharged_to_root(page);
                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
                 unlock_lruvec(lruvec);
         }
@@ -583,6 +584,7 @@ void release_pages(struct page **pages,
                         page_relock_lruvec(page,&lruvec);
                         VM_BUG_ON(!PageLRU(page));
                         __ClearPageLRU(page);
+                       mem_cgroup_reset_uncharged_to_root(page);
                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
                 }

--- mmotm.orig/mm/vmscan.c      2012-02-18 11:57:42.679524592 -0800
+++ mmotm/mm/vmscan.c   2012-02-18 11:57:49.107524745 -0800
@@ -1087,11 +1087,11 @@ int __isolate_lru_page(struct page *page

         if (likely(get_page_unless_zero(page))) {
                 /*
-                * Be careful not to clear PageLRU until after we're
-                * sure the page is not being freed elsewhere -- the
-                * page release code relies on it.
+                * Beware of interface change: now leave ClearPageLRU(page)
+                * to the caller, because memcg's lumpy and compaction
+                * cases (approaching the page by its physical location)
+                * may not have the right lru_lock yet.
                  */
-               ClearPageLRU(page);
                 ret = 0;
         }

@@ -1154,7 +1154,16 @@ static unsigned long isolate_lru_pages(u

                 switch (__isolate_lru_page(page, mode, file)) {
                 case 0:
+#ifdef CONFIG_DEBUG_VM
+                       /* check lock on page is lock we already got */
+                       page_relock_lruvec(page,&lruvec);
+                       BUG_ON(lruvec != home_lruvec);
+                       BUG_ON(page != lru_to_page(src));
+                       BUG_ON(page_lru(page) != lru);
+#endif
+                       ClearPageLRU(page);
                         isolated_pages = hpage_nr_pages(page);
+                       mem_cgroup_reset_uncharged_to_root(page);
                         mem_cgroup_update_lru_size(lruvec, lru, -isolated_pages);
                         list_move(&page->lru, dst);
                         nr_taken += isolated_pages;
@@ -1211,21 +1220,7 @@ static unsigned long isolate_lru_pages(u
                             !PageSwapCache(cursor_page))
                                 break;

-                       if (__isolate_lru_page(cursor_page, mode, file) == 0) {
-                               mem_cgroup_page_relock_lruvec(cursor_page,
-&lruvec);
-                               isolated_pages = hpage_nr_pages(cursor_page);
-                               mem_cgroup_update_lru_size(lruvec,
-                                       page_lru(cursor_page), -isolated_pages);
-                               list_move(&cursor_page->lru, dst);
-
-                               nr_taken += isolated_pages;
-                               nr_lumpy_taken += isolated_pages;
-                               if (PageDirty(cursor_page))
-                                       nr_lumpy_dirty += isolated_pages;
-                               scan++;
-                               pfn += isolated_pages - 1;
-                       } else {
+                       if (__isolate_lru_page(cursor_page, mode, file) != 0) {
                                 /*
                                  * Check if the page is freed already.
                                  *
@@ -1243,13 +1238,50 @@ static unsigned long isolate_lru_pages(u
                                         continue;
                                 break;
                         }
+
+                       /*
+                        * This locking call is a no-op in the non-memcg
+                        * case, since we already hold the right lru_lock;
+                        * but it may change the lock in the memcg case.
+                        * It is then vital to recheck PageLRU (but not
+                        * necessary to recheck isolation mode).
+                        */
+                       mem_cgroup_page_relock_lruvec(cursor_page,&lruvec);
+
+                       if (PageLRU(cursor_page)&&
+                           !PageUnevictable(cursor_page)) {
+                               ClearPageLRU(cursor_page);
+                               isolated_pages = hpage_nr_pages(cursor_page);
+                               mem_cgroup_reset_uncharged_to_root(cursor_page);
+                               mem_cgroup_update_lru_size(lruvec,
+                                       page_lru(cursor_page), -isolated_pages);
+                               list_move(&cursor_page->lru, dst);
+
+                               nr_taken += isolated_pages;
+                               nr_lumpy_taken += isolated_pages;
+                               if (PageDirty(cursor_page))
+                                       nr_lumpy_dirty += isolated_pages;
+                               scan++;
+                               pfn += isolated_pages - 1;
+                       } else {
+                               /* Cannot hold lru_lock while freeing page */
+                               unlock_lruvec(lruvec);
+                               lruvec = NULL;
+                               put_page(cursor_page);
+                               break;
+                       }
                 }

                 /* If we break out of the loop above, lumpy reclaim failed */
                 if (pfn<  end_pfn)
                         nr_lumpy_failed++;

-               lruvec = home_lruvec;
+               if (lruvec != home_lruvec) {
+                       if (lruvec)
+                               unlock_lruvec(lruvec);
+                       lruvec = home_lruvec;
+                       lock_lruvec(lruvec);
+               }
         }

         *nr_scanned = scan;
@@ -1301,6 +1333,7 @@ int isolate_lru_page(struct page *page)
                         int lru = page_lru(page);
                         get_page(page);
                         ClearPageLRU(page);
+                       mem_cgroup_reset_uncharged_to_root(page);
                         del_page_from_lru_list(page, lruvec, lru);
                         ret = 0;
                 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>