On Tue, Jan 08, 2013 at 08:17:59PM -0800, Hugh Dickins wrote: > On Mon, 7 Jan 2013, Mel Gorman wrote: > > > Hugh Dickins pointed out that migrate_misplaced_transhuge_page() does not > > check page_count before migrating like base page migration and khugepage. He > > could not see why this was safe and he is right. > > > > The potential impact of the bug is avoided due to the limitations of NUMA > > balancing. The page_mapcount() check ensures that only a single address > > space is using this page and as THPs are typically private it should not be > > possible for another address space to fault it in parallel. If the address > > space has one associated task then it's difficult to have both a GUP pin > > and be referencing the page at the same time. If there are multiple tasks > > then a buggy scenario requires that another thread be accessing the page > > while the direct IO is in flight. This is dodgy behaviour as there is > > a possibility of corruption with or without THP migration. It would be > > difficult to identify the corruption as being a migration bug. > > > > While we happen to be safe for the most part it is shoddy to depend on > > such "safety" so this patch checks the page count similar to anonymous > > pages. Note that this does not mean that the page_mapcount() check can go > > away. If we were to remove the page_mapcount() check the the THP would > > have to be unmapped from all referencing PTEs, replaced with migration > > PTEs and restored properly afterwards. > > > > Reported-by: Hugh Dickins <hughd@xxxxxxxxxx> > > Signed-off-by: Mel Gorman <mgorman@xxxxxxx> > > Sorry, Mel, it's a NAK: Don't be sorry. It's not your fault I am a muppet. > you will have expected an ack from me two weeks > or more ago; but somehow I had an intuition that if I sat on it for > long enough, a worm would crawl out. Got down to looking again today, > and I notice that although the putback_lru_page() is right, > NR_ISOLATED_ANON is not restored on this path, so that would leak. > > I expect you'll want to do something like: > if (isolated) { > putback_lru_page(page); > isolated = 0; > goto out; > } > and that may be the appropriate fix right now. > Yes, I sent what should be a correction so we can treat this cleanup as a potential replacemet for it. > But I do still dislike the way you always put_page in > numamigrate_isolate_page(): At one point there was a conditional put_page depending on different failures and it was a very difficult to follow. This was easier to follow but could still be improved. > it makes sense in the case when > isolate_lru_page() succeeds (I've long thought that weird both to > insist on an existing page reference and add one of its own), but > I find it very confusing on the failure paths, to have the put_page > far away from the unlock_page - and I get worried when I see put_page > followed by unlock_page rather than vice versa (it happens on !pmd_same > paths: if the pmd is not the same, then can we be sure that the put_page > does not free the page?) > I'm depending on the page table reference to prevent the put_page() freeing the page. As mmap_sem is held, there cannot be an munmap() freeing it. As we hold the page lock there cannot be a parallel reclaim as trylock_page() in shrink_page_list() will fail. > At the bottom I've put my own cleanup for this, which simplifies by > doing the putback_lru_page() inside numamigrate_isolate_page(), and > doesn't put_page when it doesn't isolate. > > I think the only functional difference from yours (aside from fixing > up NR_ISOLATED) is that migrate_misplaced_transhuge_page() doesn't > have to pretend to its caller that it succeeded when actually it > failed at the last hurdle (because it already did the unlock_page, > which in yours the caller expects to do on failure). Oh, and I'm > not holding page lock (sometimes) at clear_pmdnuma: I didn't see > the reason for that, perhaps I'm missing something important there. > > Maybe our tastes differ, and you won't see mine as an improvement. > And I've hardly tested, so haven't signed off, and won't be > surprised if its own worms crawl out. > > <SNIP> > > Not-signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx> > --- > > mm/huge_memory.c | 28 +++++---------- > mm/migrate.c | 79 ++++++++++++++++++++++----------------------- > 2 files changed, 48 insertions(+), 59 deletions(-) > > --- 3.8-rc2/mm/huge_memory.c 2012-12-22 09:43:27.616015582 -0800 > +++ linux/mm/huge_memory.c 2013-01-08 17:39:06.340407864 -0800 > @@ -1298,7 +1298,6 @@ int do_huge_pmd_numa_page(struct mm_stru > int target_nid; > int current_nid = -1; > bool migrated; > - bool page_locked = false; > > spin_lock(&mm->page_table_lock); > if (unlikely(!pmd_same(pmd, *pmdp))) > @@ -1320,7 +1319,6 @@ int do_huge_pmd_numa_page(struct mm_stru > /* Acquire the page lock to serialise THP migrations */ > spin_unlock(&mm->page_table_lock); > lock_page(page); > - page_locked = true; > > /* Confirm the PTE did not while locked */ > spin_lock(&mm->page_table_lock); > @@ -1333,34 +1331,26 @@ int do_huge_pmd_numa_page(struct mm_stru > > /* Migrate the THP to the requested node */ > migrated = migrate_misplaced_transhuge_page(mm, vma, > - pmdp, pmd, addr, > - page, target_nid); > - if (migrated) > - current_nid = target_nid; > - else { > - spin_lock(&mm->page_table_lock); > - if (unlikely(!pmd_same(pmd, *pmdp))) { > - unlock_page(page); > - goto out_unlock; > - } > - goto clear_pmdnuma; > - } > + pmdp, pmd, addr, page, target_nid); > + if (!migrated) > + goto check_same; > > - task_numa_fault(current_nid, HPAGE_PMD_NR, migrated); > + task_numa_fault(target_nid, HPAGE_PMD_NR, true); > return 0; > Ok. > +check_same: > + spin_lock(&mm->page_table_lock); > + if (unlikely(!pmd_same(pmd, *pmdp))) > + goto out_unlock; Ok. > clear_pmdnuma: > pmd = pmd_mknonnuma(pmd); > set_pmd_at(mm, haddr, pmdp, pmd); > VM_BUG_ON(pmd_numa(*pmdp)); > update_mmu_cache_pmd(vma, addr, pmdp); > - if (page_locked) > - unlock_page(page); > - This is ok too. The lock page is to prevent the page being reclaimed in parallel during migration. In the two cases you end up in clear_pmdnuma the page table lock is taken and you've done a pmd_same check and the page lock is not necessary. > out_unlock: > spin_unlock(&mm->page_table_lock); > if (current_nid != -1) > - task_numa_fault(current_nid, HPAGE_PMD_NR, migrated); > + task_numa_fault(current_nid, HPAGE_PMD_NR, false); > return 0; > } > > --- 3.8-rc2/mm/migrate.c 2012-12-22 09:43:27.636015582 -0800 > +++ linux/mm/migrate.c 2013-01-08 18:17:02.664144777 -0800 > @@ -1555,39 +1555,38 @@ bool numamigrate_update_ratelimit(pg_dat > > int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) > { > - int ret = 0; > + int page_lru; > > /* Avoid migrating to a node that is nearly full */ > - if (migrate_balanced_pgdat(pgdat, 1)) { > - int page_lru; > + if (!migrate_balanced_pgdat(pgdat, 1)) > + return 0; > > - if (isolate_lru_page(page)) { > - put_page(page); > - return 0; > - } > - > - /* Page is isolated */ > - ret = 1; > - page_lru = page_is_file_cache(page); > - if (!PageTransHuge(page)) > - inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru); > - else > - mod_zone_page_state(page_zone(page), > - NR_ISOLATED_ANON + page_lru, > - HPAGE_PMD_NR); > + if (isolate_lru_page(page)) > + return 0; > + > + /* > + * migrate_misplaced_transhuge_page() skips page migration's usual > + * check on page_count(), so we must do it here, now that the page > + * has been isolated: a GUP pin, or any other pin, prevents migration. > + * The expected page count is 3: 1 for page's mapcount and 1 for the > + * caller's pin and 1 for the reference taken by isolate_lru_page(). > + */ > + if (PageTransHuge(page) && page_count(page) != 3) { > + putback_lru_page(page); > + return 0; > } > Ok so you putback the page before the isolated count is updated. Makes sense. > + page_lru = page_is_file_cache(page); > + mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru, > + hpage_nr_pages(page)); > + > /* > - * Page is either isolated or there is not enough space on the target > - * node. If isolated, then it has taken a reference count and the > - * callers reference can be safely dropped without the page > - * disappearing underneath us during migration. Otherwise the page is > - * not to be migrated but the callers reference should still be > - * dropped so it does not leak. > + * Isolating the page has taken another reference, so the > + * caller's reference can be safely dropped without the page > + * disappearing underneath us during migration. > */ > put_page(page); > - > - return ret; > + return 1; > } > > /* > @@ -1598,7 +1597,7 @@ int numamigrate_isolate_page(pg_data_t * > int migrate_misplaced_page(struct page *page, int node) > { > pg_data_t *pgdat = NODE_DATA(node); > - int isolated = 0; > + int isolated; > int nr_remaining; > LIST_HEAD(migratepages); > > @@ -1606,20 +1605,16 @@ int migrate_misplaced_page(struct page * > * Don't migrate pages that are mapped in multiple processes. > * TODO: Handle false sharing detection instead of this hammer > */ > - if (page_mapcount(page) != 1) { > - put_page(page); > + if (page_mapcount(page) != 1) > goto out; > - } > > /* > * Rate-limit the amount of data that is being migrated to a node. > * Optimal placement is no good if the memory bus is saturated and > * all the time is being spent migrating! > */ > - if (numamigrate_update_ratelimit(pgdat, 1)) { > - put_page(page); > + if (numamigrate_update_ratelimit(pgdat, 1)) > goto out; > - } > > isolated = numamigrate_isolate_page(pgdat, page); > if (!isolated) > @@ -1636,8 +1631,11 @@ int migrate_misplaced_page(struct page * > } else > count_vm_numa_event(NUMA_PAGE_MIGRATE); > BUG_ON(!list_empty(&migratepages)); > -out: > return isolated; > + > +out: > + put_page(page); > + return 0; > } Ok. > #endif /* CONFIG_NUMA_BALANCING */ > > @@ -1672,17 +1670,15 @@ int migrate_misplaced_transhuge_page(str > > new_page = alloc_pages_node(node, > (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER); > - if (!new_page) { > - count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR); > - goto out_dropref; > - } > + if (!new_page) > + goto out_fail; > + > page_xchg_last_nid(new_page, page_last_nid(page)); > > isolated = numamigrate_isolate_page(pgdat, page); > if (!isolated) { > - count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR); > put_page(new_page); > - goto out_keep_locked; > + goto out_fail; > } > > /* Prepare a page as a migration target */ > @@ -1714,6 +1710,7 @@ int migrate_misplaced_transhuge_page(str > putback_lru_page(page); > > count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR); > + isolated = 0; > goto out; > } > > @@ -1758,9 +1755,11 @@ out: > -HPAGE_PMD_NR); > return isolated; > > +out_fail: > + count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR); > out_dropref: > + unlock_page(page); > put_page(page); > -out_keep_locked: > return 0; > } > #endif /* CONFIG_NUMA_BALANCING */ I haven't tested this but I cannot see a problem with it either and the flow does look nicer. I'll test it this evening and look at it some more. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>