* KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> [2010-05-07 15:26:06]: > > It seems Mel's fix will go ahead, soon. (it works well) So, we can do > test migration by compaction, again. Let's start to fix memcg. > This is a patch under a test in my box (-mm + Mel's fix). > > == > At page migration, the new page is unlocked before calling end_migration(). > This is mis-understanding with page-migration code of memcg. > (it worked but...) > This has small race and we need to fix this race. By this, > FILE_MAPPED of migrated file cache is not properly updated, now. > > This patch is for fixing the race by changing algorithm. > > At migrating mapped file, events happens in following sequence. > > 1. allocate a new page. > 2. get memcg of an old page. > 3. charge ageinst a new page before migration. But at this point, > no changes to new page's page_cgroup, no commit-charge. > 4. page migration replaces radix-tree, old-page and new-page. > 5. page migration remaps the new page if the old page was mapped. > 6. Here, the new page is unlocked. > 7. memcg commits the charge for newpage, Mark page cgroup as USED. > > Because "commit" happens after page-remap, we can count FILE_MAPPED > at "5", we can lose file_mapped accounting information within this > small race. FILE_MAPPED is updated only when mapcount changes 0->1. > So, if we don't catch this 0->1 event, we can underflow FILE_MAPPED > at 1->0 event. > > We may be able to avoid underflow by some small technique but > we should catch mapcount 0->1 event. To catch this, we have to > make page_cgroup of new page as "USED". > > BTW, historically, above implemntation comes from migration-failure > of anonymous page. Because we charge both of old page and new page > with mapcount=0, we can't catch > - the page is really freed before remap. > - migration fails but it's freed before remap > or .....corner cases. > > For fixing all, this changes parepare/end migration. > > New migration sequence with memcg is: > > 1. allocate a new page. > 2. mark PageCgroupMigration to the old page. > 3. charge against a new page onto the old page's memcg. (here, new page's pc > is marked as PageCgroupUsed.) > 4. mark PageCgroupMigration to the old page. > > If page_cgroup is marked as PageCgroupMigration, it's uncahrged until > it's cleared. > > 5. page migration replaces radix-tree, page table, etc... > 6. At remapping, new page's page_cgroup is now makrked as "USED" > We can catch 0->1 event and FILE_MAPPED will be properly updated. > And we can catch SWAPOUT event after unlock this and freeing this > page by unmap() can be caught. > > 7. Clear PageCgroupMigration of the old page. > > By this: > At migration success of Anon: > - The new page is properly charged. If not-mapped after remap, > uncharge() will be called. > - The file cache is properly charged. FILE_MAPPED event can be caught. > > At migration failure of Anon: > - The old page stays as charged. If not mapped after remap, > uncharge() will called. The corner case is SWAPOUT. But, while migraion, > it's locked. So, we have no race with it. > > Then, for what MIGRATION flag is ? > Without it, at migration failure, we may have to charge old page again > because it may be fully unmapped. "charge" means that we have to dive into > memory reclaim or something complated. So, it's better to avoid > charge it again. Before this patch, __commit_charge() was working for > both of the old/new page and fixed up all. But this technique has some > racy condtion around FILE_MAPPED and SWAPOUT etc... > Now, the kernel use MIGRATION flag and don't uncharge old page until > the end of migration. > > > I hope this change will make memcg's page migration much simpler. > This page migration has caused several troubles. Worth to add > a flag for simplification. > Yep it is, the explanation and race is getting very complex. I am glad you are looking at it so closely. I wonder if we can simplify some of this, may be deserves a closer look later. > Changelog: 2010/04/20 > - fixed SWAPOUT case. > - added texts for explanation. > - removed MIGRAION flag onto new page. > > Changelog: 2010/04/15 > - updated against mm-of-the-moment snapshot 2010-04-15-14-42 > + Nishimura's fix. memcg-fix-prepare-migration-fix.patch > - fixed some typos. > - handle migration failure of anon page. > > Changelog: 2010/04/14 > - updated onto the latest mmotm + page-compaction, etc... > - fixed __try_charge() bypass case. > - use CHARGE_TYPE_FORCE for uncharging an unused page. > > Reported-by: Daisuke Nishimura <nishimura@xxxxxxxxxxxxxxxxx> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> > --- > include/linux/memcontrol.h | 6 + > include/linux/page_cgroup.h | 5 + > mm/memcontrol.c | 145 +++++++++++++++++++++++++++++++------------- > mm/migrate.c | 2 > 4 files changed, 115 insertions(+), 43 deletions(-) > > Index: linux-2.6.34-rc4-mm1/mm/memcontrol.c > =================================================================== > --- linux-2.6.34-rc4-mm1.orig/mm/memcontrol.c > +++ linux-2.6.34-rc4-mm1/mm/memcontrol.c > @@ -2278,7 +2278,8 @@ __mem_cgroup_uncharge_common(struct page > switch (ctype) { > case MEM_CGROUP_CHARGE_TYPE_MAPPED: > case MEM_CGROUP_CHARGE_TYPE_DROP: > - if (page_mapped(page)) > + /* See mem_cgroup_prepare_migration() */ > + if (page_mapped(page) || PageCgroupMigration(pc)) > goto unlock_out; > break; > case MEM_CGROUP_CHARGE_TYPE_SWAPOUT: > @@ -2501,10 +2502,12 @@ static inline int mem_cgroup_move_swap_a > * Before starting migration, account PAGE_SIZE to mem_cgroup that the old > * page belongs to. > */ > -int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr) > +int mem_cgroup_prepare_migration(struct page *page, > + struct page *newpage, struct mem_cgroup **ptr) > { > struct page_cgroup *pc; > struct mem_cgroup *mem = NULL; > + enum charge_type ctype; > int ret = 0; > > if (mem_cgroup_disabled()) > @@ -2515,69 +2518,131 @@ int mem_cgroup_prepare_migration(struct > if (PageCgroupUsed(pc)) { > mem = pc->mem_cgroup; > css_get(&mem->css); > + /* > + * At migrationg an anonymous page, its mapcount goes down > + * to 0 and uncharge() will be called. But, even if it's fully > + * unmapped, migration may fail and this page has to be > + * charged again. We set MIGRATION flag here and delay uncharge > + * until end_migration() is called > + * > + * Corner Case Thinking > + * A) > + * When the old page was mapped as Anon and it's unmap-and-freed > + * while migration was ongoing, it's not uncharged until > + * end_migration(). Both of old and new page will be uncharged > + * at end_migration() because it's not mapped and not SwapCache. > + * > + * B) > + * When the old page was mapped but migraion fails, the kernel > + * remap it. The charge for it is kept by MIGRATION flag even > + * if mapcount goes down to 0, we can do remap successfully > + * without charging it again. > + * If the kernel doesn't remap it because it's unmapped, > + * we can check it at end_migration(), no new charge happens > + * at end_migration(). > + * > + * C) > + * The "old" page is under lock_page() until the end of > + * migration, so, the old page itself will not be swapped-out. > + * But the new page can be. > + * etc... > + */ > + if (PageAnon(page)) > + SetPageCgroupMigration(pc); > } > unlock_page_cgroup(pc); > + /* > + * If the page is not charged at this point, > + * we return here. > + */ > + if (!mem) > + return 0; > > *ptr = mem; > - if (mem) { > - ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false); > - css_put(&mem->css); > + ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false); > + css_put(&mem->css);/* drop extra refcnt */ > + if (ret || *ptr == NULL) { > + if (PageAnon(page)) { > + lock_page_cgroup(pc); > + ClearPageCgroupMigration(pc); > + unlock_page_cgroup(pc); > + /* > + * The old page may be fully unmapped while we kept it. > + */ > + mem_cgroup_uncharge_page(page); > + } > + return -ENOMEM; > } > + /* > + * We charge new page before it's mapped. So, even if unlock_page() > + * is called far before end_migration, we can catch all events on > + * this new page. In the case new page is migrated but not remapped, > + * new page's mapcount will be finally 0 and we call uncharge in > + * end_migration(). > + */ > + pc = lookup_page_cgroup(newpage); > + if (PageAnon(page)) > + ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED; > + else if (page_is_file_cache(page)) > + ctype = MEM_CGROUP_CHARGE_TYPE_CACHE; > + else > + ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM; > + __mem_cgroup_commit_charge(mem, pc, ctype); > return ret; > } > > /* remove redundant charge if migration failed*/ > void mem_cgroup_end_migration(struct mem_cgroup *mem, > - struct page *oldpage, struct page *newpage) > + struct page *oldpage, struct page *newpage) > { > - struct page *target, *unused; > + struct page *used, *unused; > struct page_cgroup *pc; > - enum charge_type ctype; > > if (!mem) > return; > + /* blocks rmdir() */ > cgroup_exclude_rmdir(&mem->css); > /* at migration success, oldpage->mapping is NULL. */ > if (oldpage->mapping) { > - target = oldpage; > - unused = NULL; > + used = oldpage; > + unused = newpage; > } else { > - target = newpage; > + used = newpage; > unused = oldpage; > } > - > - if (PageAnon(target)) > - ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED; > - else if (page_is_file_cache(target)) > - ctype = MEM_CGROUP_CHARGE_TYPE_CACHE; > - else > - ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM; > - > - /* unused page is not on radix-tree now. */ > - if (unused) > - __mem_cgroup_uncharge_common(unused, ctype); > - > - pc = lookup_page_cgroup(target); > /* > - * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup. > - * So, double-counting is effectively avoided. > + * We disallowed uncharge of pages under migration because mapcount > + * of the page goes down to zero, temporarly. > + * Clear the flag and check the page should be charged. > */ > - __mem_cgroup_commit_charge(mem, pc, ctype); > - > + pc = lookup_page_cgroup(unused); > + /* This flag itself is not racy, so, check it before lock */ > + if (PageCgroupMigration(pc)) { > + lock_page_cgroup(pc); > + ClearPageCgroupMigration(pc); > + unlock_page_cgroup(pc); > + } > + __mem_cgroup_uncharge_common(unused, MEM_CGROUP_CHARGE_TYPE_FORCE); > + > + pc = lookup_page_cgroup(used); > + if (PageCgroupMigration(pc)) { > + lock_page_cgroup(pc); > + ClearPageCgroupMigration(pc); > + unlock_page_cgroup(pc); > + } Could you please explain the double check for PageCgroupMigration? > /* > - * Both of oldpage and newpage are still under lock_page(). > - * Then, we don't have to care about race in radix-tree. > - * But we have to be careful that this page is unmapped or not. > - * > - * There is a case for !page_mapped(). At the start of > - * migration, oldpage was mapped. But now, it's zapped. > - * But we know *target* page is not freed/reused under us. > - * mem_cgroup_uncharge_page() does all necessary checks. > - */ > - if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) > - mem_cgroup_uncharge_page(target); > + * If a page is a file cache, radix-tree replacement is very atomic > + * and we can skip this check. When it was an Anon page, its mapcount > + * goes down to 0. But because we added MIGRATION flage, it's not > + * uncharged yet. There are several case but page->mapcount check > + * and USED bit check in mem_cgroup_uncharge_page() will do enough > + * check. (see prepare_charge() also) > + */ > + if (PageAnon(used)) > + mem_cgroup_uncharge_page(used); > /* > - * At migration, we may charge account against cgroup which has no tasks > + * At migration, we may charge account against cgroup which has no > + * tasks. > * So, rmdir()->pre_destroy() can be called while we do this charge. > * In that case, we need to call pre_destroy() again. check it here. > */ > Index: linux-2.6.34-rc4-mm1/mm/migrate.c > =================================================================== > --- linux-2.6.34-rc4-mm1.orig/mm/migrate.c > +++ linux-2.6.34-rc4-mm1/mm/migrate.c > @@ -581,7 +581,7 @@ static int unmap_and_move(new_page_t get > } > > /* charge against new page */ > - charge = mem_cgroup_prepare_migration(page, &mem); > + charge = mem_cgroup_prepare_migration(page, newpage, &mem); > if (charge == -ENOMEM) { > rc = -ENOMEM; > goto unlock; > Index: linux-2.6.34-rc4-mm1/include/linux/memcontrol.h > =================================================================== > --- linux-2.6.34-rc4-mm1.orig/include/linux/memcontrol.h > +++ linux-2.6.34-rc4-mm1/include/linux/memcontrol.h > @@ -89,7 +89,8 @@ int mm_match_cgroup(const struct mm_stru > extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *mem); > > extern int > -mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr); > +mem_cgroup_prepare_migration(struct page *page, > + struct page *newpage, struct mem_cgroup **ptr); > extern void mem_cgroup_end_migration(struct mem_cgroup *mem, > struct page *oldpage, struct page *newpage); > > @@ -228,7 +229,8 @@ static inline struct cgroup_subsys_state > } > > static inline int > -mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr) > +mem_cgroup_prepare_migration(struct page *page, struct page *newpage, > + struct mem_cgroup **ptr) > { > return 0; > } > Index: linux-2.6.34-rc4-mm1/include/linux/page_cgroup.h > =================================================================== > --- linux-2.6.34-rc4-mm1.orig/include/linux/page_cgroup.h > +++ linux-2.6.34-rc4-mm1/include/linux/page_cgroup.h > @@ -40,6 +40,7 @@ enum { > PCG_USED, /* this object is in use. */ > PCG_ACCT_LRU, /* page has been accounted for */ > PCG_FILE_MAPPED, /* page is accounted as "mapped" */ > + PCG_MIGRATION, /* under page migration */ > }; > > #define TESTPCGFLAG(uname, lname) \ > @@ -79,6 +80,10 @@ SETPCGFLAG(FileMapped, FILE_MAPPED) > CLEARPCGFLAG(FileMapped, FILE_MAPPED) > TESTPCGFLAG(FileMapped, FILE_MAPPED) > > +SETPCGFLAG(Migration, MIGRATION) > +CLEARPCGFLAG(Migration, MIGRATION) > +TESTPCGFLAG(Migration, MIGRATION) > + > static inline int page_cgroup_nid(struct page_cgroup *pc) > { > return page_to_nid(pc->page); > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxxx For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>