Hi Aaditya, I want to discuss this problem on another thread. On 06/19/2012 10:18 PM, Aaditya Kumar wrote: > On Mon, Jun 18, 2012 at 6:13 AM, Minchan Kim <minchan@xxxxxxxxxx> wrote: >> On 06/17/2012 02:48 AM, Aaditya Kumar wrote: >> >>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote: >>> >>>>> >>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep >>>>> if node has multiple zones. Hm ok, I realized my descriptions was >>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls >>>>> pgdat_balanced() >>>>> every priority. Most easy case is, movable zone has a lot of free pages and >>>>> normal zone has no reclaimable page. >>>>> >>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should >>>>> sleep only if every zones have much free pages than high water mark >>>>> _and_ 25% of present pages in node are free. >>>>> >>>> >>>> >>>> Sorry. I can't understand your point. >>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark. >>>> It seems I am missing your point. >>>> Please anybody correct me. >>> >>> Since currently direct reclaim is given up based on >>> zone->all_unreclaimable flag, >>> so for e.g in one of the scenarios: >>> >>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we >>> hot-remove the all the pages of the MOVABLE zone. >>> >>> While migrating pages during memory hot-unplugging, the allocation function >>> (for new page to which the page in MOVABLE zone would be moved) can end up >>> looping in direct reclaim path for ever. >>> >>> This is so because when most of the pages in the MOVABLE zone have >>> been migrated, >>> the zone now contains lots of free memory (basically above low watermark) >>> BUT all are in MIGRATE_ISOLATE list of the buddy list. >>> >>> So kswapd() would not balance this zone as free pages are above low watermark >>> (but all are in isolate list). So zone->all_unreclaimable flag would >>> never be set for this zone >>> and allocation function would end up looping forever. (assuming the >>> zone NORMAL is >>> left with no reclaimable memory) >>> >> >> >> Thanks a lot, Aaditya! Scenario you mentioned makes perfect. >> But I don't see it's a problem of kswapd. > > Hi Kim, I like called Minchan rather than Kim Never mind. :) > > Yes I agree it is not a problem of kswapd. Yeb. > >> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list >> but we can't allocate it. :( >> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation. >> Kswapd is just one of them confused. >> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too. >> >> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free] > > > I assume that by the inconsistency you mention above, you mean > temporary inconsistency. > > Sorry, but IMHO as for memory hot plug the main issue with this patch > is that the inconsistency you mentioned above would NOT be a temporary > inconsistency. > > Every time say 'x' number of page frames are off lined, they will > introduce a difference of 'x' pages between > NR_FREE_PAGES and SumOf[free_area[order].nr_free]. > (So for e.g. if we do a frequent offline/online it will make > NR_FREE_PAGES negative) > > This is so because, unset_migratetype_isolate() is called from > offlining code (to set the migrate type of off lined pages again back > to MIGRATE_MOVABLE) > after the pages have been off lined and removed from the buddy list. > Since the pages for which unset_migratetype_isolate() is called are > not buddy pages so move_freepages_block() does not move any page, and > thus introducing a permanent inconsistency. Good point. Negative NR_FREE_PAGES is caused by double counting by my patch and __offline_isolated_pages. I think at first MIGRATE_ISOLATE type freed page shouldn't account as free page. > >> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect >> free_area[order].nr_free exactly. >> >> Any thought? > > As for fixing move_freepages_block(), At least for memory hot plug, > the pages stay in MIGRATE_ISOLATE list only for duration > offline_pages() function, > I mean only temporarily. Since fixing move_freepages_block() for will > introduce some overhead, So I am not very sure whether that overhead > is justified > for a temporary condition. What do you think? Yes. I don't like hurt fast path, either. How about this? (Passed just compile test :( ) The patch's goal is to NOT increase nr_free and NR_FREE_PAGES about freed page into MIGRATE_ISOLATED. This patch hurts high order page free path but I think it's not critical because higher order allocation is rare than order-0 allocation and we already have done same thing on free_hot_cold_page on order-0 free path which is more hot. Maybe below patch is completed malformed. I can't inline the code at the office. Sorry. Instead, I will attach the patch. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4403009..d2a515d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -676,6 +676,24 @@ static void free_pcppages_bulk(struct zone *zone, int count, spin_unlock(&zone->lock); } +/* + * This function is almost same with free_one_page except that it + * doesn't increase NR_FREE_PAGES and free_area[order].nr_free. + * Because page allocator can't allocate MIGRATE_ISOLATE type page. + * + * Caller should hold zone->lock. + */ +static void free_one_isolated_page(struct zone *zone, struct page *page, + int order) +{ + zone->all_unreclaimable = 0; + zone->pages_scanned = 0; + + __free_one_page(page, zone, order, MIGRATE_ISOLATE); + /* rollback nr_free increased by __free_one_page */ + zone->free_area[order].nr_free--; +} + static void free_one_page(struct zone *zone, struct page *page, int order, int migratetype) { @@ -683,6 +701,13 @@ static void free_one_page(struct zone *zone, struct page *page, int order, zone->all_unreclaimable = 0; zone->pages_scanned = 0; + /* + * Freed MIGRATE_ISOLATE page should be free_one_isolated_page path + * because page allocator don't want to increase NR_FREE_PAGES and + * free_area[order].nr_free. + */ + VM_BUG_ON(migratetype == MIGRATE_ISOLATE); + __free_one_page(page, zone, order, migratetype); __mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order); spin_unlock(&zone->lock); @@ -718,6 +743,7 @@ static void __free_pages_ok(struct page *page, unsigned int order) { unsigned long flags; int wasMlocked = __TestClearPageMlocked(page); + int migratetype; if (!free_pages_prepare(page, order)) return; @@ -726,8 +752,21 @@ static void __free_pages_ok(struct page *page, unsigned int order) if (unlikely(wasMlocked)) free_page_mlock(page); __count_vm_events(PGFREE, 1 << order); - free_one_page(page_zone(page), page, order, - get_pageblock_migratetype(page)); + migratetype = get_pageblock_migratetype(page); + /* + * High order page alloc/free is rare compared to + * order-0. So this condition check should be not + * critical about performance. + */ + if (unlikely(migratetype == MIGRATE_ISOLATE)) { + struct zone *zone = page_zone(page); + spin_lock(&zone->lock); + free_one_isolated_page(zone, page, order); + spin_unlock(&zone->lock); + } + else { + free_one_page(page_zone(page), page, order, migratetype); + } local_irq_restore(flags); } @@ -906,6 +945,55 @@ static int fallbacks[MIGRATE_TYPES][4] = { [MIGRATE_ISOLATE] = { MIGRATE_RESERVE }, /* Never used */ }; +static int hotplug_move_freepages(struct zone *zone, + struct page *start_page, struct page *end_page, + int from_migratetype, int to_migratetype) +{ + struct page *page; + unsigned long order; + int pages_moved = 0; + +#ifndef CONFIG_HOLES_IN_ZONE + /* + * page_zone is not safe to call in this context when + * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant + * anyway as we check zone boundaries in move_freepages_block(). + * Remove at a later date when no bug reports exist related to + * grouping pages by mobility + */ + BUG_ON(page_zone(start_page) != page_zone(end_page)); +#endif + + BUG_ON(from_migratetype == to_migratetype); + + for (page = start_page; page <= end_page;) { + /* Make sure we are not inadvertently changing nodes */ + VM_BUG_ON(page_to_nid(page) != zone_to_nid(zone)); + + if (!pfn_valid_within(page_to_pfn(page))) { + page++; + continue; + } + + if (!PageBuddy(page)) { + page++; + continue; + } + + order = page_order(page); + list_move(&page->lru, + &zone->free_area[order].free_list[to_migratetype]); + if (to_migratetype == MIGRATE_ISOLATE) + zone->free_area[order].nr_free--; + else if (from_migratetype == MIGRATE_ISOLATE) + zone->free_area[order].nr_free++; + page += 1 << order; + pages_moved += 1 << order; + } + + return pages_moved; +} + /* * Move the free pages in a range to the free lists of the requested type. * Note that start_page and end_pages are not aligned on a pageblock @@ -954,6 +1042,32 @@ static int move_freepages(struct zone *zone, return pages_moved; } +/* + * It's almost same with move_freepages_block except [from, to] migratetype. + * We need it for accounting zone->free_area[order].nr_free exactly. + */ +static int hotplug_move_freepages_block(struct zone *zone, struct page *page, + int from_migratetype, int to_migratetype) +{ + unsigned long start_pfn, end_pfn; + struct page *start_page, *end_page; + + start_pfn = page_to_pfn(page); + start_pfn = start_pfn & ~(pageblock_nr_pages-1); + start_page = pfn_to_page(start_pfn); + end_page = start_page + pageblock_nr_pages - 1; + end_pfn = start_pfn + pageblock_nr_pages - 1; + + /* Do not cross zone boundaries */ + if (start_pfn < zone->zone_start_pfn) + start_page = page; + if (end_pfn >= zone->zone_start_pfn + zone->spanned_pages) + return 0; + + return hotplug_move_freepages(zone, start_page, end_page, + from_migratetype, to_migratetype); +} + static int move_freepages_block(struct zone *zone, struct page *page, int migratetype) { @@ -1311,7 +1425,9 @@ void free_hot_cold_page(struct page *page, int cold) */ if (migratetype >= MIGRATE_PCPTYPES) { if (unlikely(migratetype == MIGRATE_ISOLATE)) { - free_one_page(zone, page, 0, migratetype); + spin_lock(&zone->lock); + free_one_isolated_page(zone, page, 0); + spin_unlock(&zone->lock); goto out; } migratetype = MIGRATE_MOVABLE; @@ -1388,6 +1504,7 @@ int split_free_page(struct page *page) unsigned int order; unsigned long watermark; struct zone *zone; + int migratetype; BUG_ON(!PageBuddy(page)); @@ -1400,10 +1517,17 @@ int split_free_page(struct page *page) return 0; /* Remove page from free list */ + migratetype = get_pageblock_migratetype(page); list_del(&page->lru); - zone->free_area[order].nr_free--; + /* + * Page allocator didn't increase nr_free and NR_FREE_PAGES on pages + * which are in free_area[order].free_list[MIGRATE_ISOLATE] pages. + */ + if (migratetype != MIGRATE_ISOLATE) { + zone->free_area[order].nr_free--; + __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order)); + } rmv_page_order(page); - __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order)); /* Split into individual pages */ set_page_refcounted(page); @@ -5593,8 +5717,11 @@ int set_migratetype_isolate(struct page *page) out: if (!ret) { + int pages_moved; set_pageblock_migratetype(page, MIGRATE_ISOLATE); - move_freepages_block(zone, page, MIGRATE_ISOLATE); + pages_moved = hotplug_move_freepages_block(zone, page, + MIGRATE_MOVABLE, MIGRATE_ISOLATE); + __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved); } spin_unlock_irqrestore(&zone->lock, flags); @@ -5607,12 +5734,15 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype) { struct zone *zone; unsigned long flags; + int pages_moved; zone = page_zone(page); spin_lock_irqsave(&zone->lock, flags); if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) goto out; set_pageblock_migratetype(page, migratetype); - move_freepages_block(zone, page, migratetype); + pages_moved = hotplug_move_freepages_block(zone, page, + MIGRATE_ISOLATE, migratetype); + __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved); out: spin_unlock_irqrestore(&zone->lock, flags); } @@ -5900,9 +6030,6 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) #endif list_del(&page->lru); rmv_page_order(page); - zone->free_area[order].nr_free--; - __mod_zone_page_state(zone, NR_FREE_PAGES, - - (1UL << order)); for (i = 0; i < (1 << order); i++) SetPageReserved((page+i)); pfn += (1 << order); -- Kind regards, Minchan Kim
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4403009..d2a515d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -676,6 +676,24 @@ static void free_pcppages_bulk(struct zone *zone, int count, spin_unlock(&zone->lock); } +/* + * This function is almost same with free_one_page except that it + * doesn't increase NR_FREE_PAGES and free_area[order].nr_free. + * Because page allocator can't allocate MIGRATE_ISOLATE type page. + * + * Caller should hold zone->lock. + */ +static void free_one_isolated_page(struct zone *zone, struct page *page, + int order) +{ + zone->all_unreclaimable = 0; + zone->pages_scanned = 0; + + __free_one_page(page, zone, order, MIGRATE_ISOLATE); + /* rollback nr_free increased by __free_one_page */ + zone->free_area[order].nr_free--; +} + static void free_one_page(struct zone *zone, struct page *page, int order, int migratetype) { @@ -683,6 +701,13 @@ static void free_one_page(struct zone *zone, struct page *page, int order, zone->all_unreclaimable = 0; zone->pages_scanned = 0; + /* + * Freed MIGRATE_ISOLATE page should be free_one_isolated_page path + * because page allocator don't want to increase NR_FREE_PAGES and + * free_area[order].nr_free. + */ + VM_BUG_ON(migratetype == MIGRATE_ISOLATE); + __free_one_page(page, zone, order, migratetype); __mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order); spin_unlock(&zone->lock); @@ -718,6 +743,7 @@ static void __free_pages_ok(struct page *page, unsigned int order) { unsigned long flags; int wasMlocked = __TestClearPageMlocked(page); + int migratetype; if (!free_pages_prepare(page, order)) return; @@ -726,8 +752,21 @@ static void __free_pages_ok(struct page *page, unsigned int order) if (unlikely(wasMlocked)) free_page_mlock(page); __count_vm_events(PGFREE, 1 << order); - free_one_page(page_zone(page), page, order, - get_pageblock_migratetype(page)); + migratetype = get_pageblock_migratetype(page); + /* + * High order page alloc/free is rare compared to + * order-0. So this condition check should be not + * critical about performance. + */ + if (unlikely(migratetype == MIGRATE_ISOLATE)) { + struct zone *zone = page_zone(page); + spin_lock(&zone->lock); + free_one_isolated_page(zone, page, order); + spin_unlock(&zone->lock); + } + else { + free_one_page(page_zone(page), page, order, migratetype); + } local_irq_restore(flags); } @@ -906,6 +945,55 @@ static int fallbacks[MIGRATE_TYPES][4] = { [MIGRATE_ISOLATE] = { MIGRATE_RESERVE }, /* Never used */ }; +static int hotplug_move_freepages(struct zone *zone, + struct page *start_page, struct page *end_page, + int from_migratetype, int to_migratetype) +{ + struct page *page; + unsigned long order; + int pages_moved = 0; + +#ifndef CONFIG_HOLES_IN_ZONE + /* + * page_zone is not safe to call in this context when + * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant + * anyway as we check zone boundaries in move_freepages_block(). + * Remove at a later date when no bug reports exist related to + * grouping pages by mobility + */ + BUG_ON(page_zone(start_page) != page_zone(end_page)); +#endif + + BUG_ON(from_migratetype == to_migratetype); + + for (page = start_page; page <= end_page;) { + /* Make sure we are not inadvertently changing nodes */ + VM_BUG_ON(page_to_nid(page) != zone_to_nid(zone)); + + if (!pfn_valid_within(page_to_pfn(page))) { + page++; + continue; + } + + if (!PageBuddy(page)) { + page++; + continue; + } + + order = page_order(page); + list_move(&page->lru, + &zone->free_area[order].free_list[to_migratetype]); + if (to_migratetype == MIGRATE_ISOLATE) + zone->free_area[order].nr_free--; + else if (from_migratetype == MIGRATE_ISOLATE) + zone->free_area[order].nr_free++; + page += 1 << order; + pages_moved += 1 << order; + } + + return pages_moved; +} + /* * Move the free pages in a range to the free lists of the requested type. * Note that start_page and end_pages are not aligned on a pageblock @@ -954,6 +1042,32 @@ static int move_freepages(struct zone *zone, return pages_moved; } +/* + * It's almost same with move_freepages_block except [from, to] migratetype. + * We need it for accounting zone->free_area[order].nr_free exactly. + */ +static int hotplug_move_freepages_block(struct zone *zone, struct page *page, + int from_migratetype, int to_migratetype) +{ + unsigned long start_pfn, end_pfn; + struct page *start_page, *end_page; + + start_pfn = page_to_pfn(page); + start_pfn = start_pfn & ~(pageblock_nr_pages-1); + start_page = pfn_to_page(start_pfn); + end_page = start_page + pageblock_nr_pages - 1; + end_pfn = start_pfn + pageblock_nr_pages - 1; + + /* Do not cross zone boundaries */ + if (start_pfn < zone->zone_start_pfn) + start_page = page; + if (end_pfn >= zone->zone_start_pfn + zone->spanned_pages) + return 0; + + return hotplug_move_freepages(zone, start_page, end_page, + from_migratetype, to_migratetype); +} + static int move_freepages_block(struct zone *zone, struct page *page, int migratetype) { @@ -1311,7 +1425,9 @@ void free_hot_cold_page(struct page *page, int cold) */ if (migratetype >= MIGRATE_PCPTYPES) { if (unlikely(migratetype == MIGRATE_ISOLATE)) { - free_one_page(zone, page, 0, migratetype); + spin_lock(&zone->lock); + free_one_isolated_page(zone, page, 0); + spin_unlock(&zone->lock); goto out; } migratetype = MIGRATE_MOVABLE; @@ -1388,6 +1504,7 @@ int split_free_page(struct page *page) unsigned int order; unsigned long watermark; struct zone *zone; + int migratetype; BUG_ON(!PageBuddy(page)); @@ -1400,10 +1517,17 @@ int split_free_page(struct page *page) return 0; /* Remove page from free list */ + migratetype = get_pageblock_migratetype(page); list_del(&page->lru); - zone->free_area[order].nr_free--; + /* + * Page allocator didn't increase nr_free and NR_FREE_PAGES on pages + * which are in free_area[order].free_list[MIGRATE_ISOLATE] pages. + */ + if (migratetype != MIGRATE_ISOLATE) { + zone->free_area[order].nr_free--; + __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order)); + } rmv_page_order(page); - __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order)); /* Split into individual pages */ set_page_refcounted(page); @@ -5593,8 +5717,11 @@ int set_migratetype_isolate(struct page *page) out: if (!ret) { + int pages_moved; set_pageblock_migratetype(page, MIGRATE_ISOLATE); - move_freepages_block(zone, page, MIGRATE_ISOLATE); + pages_moved = hotplug_move_freepages_block(zone, page, + MIGRATE_MOVABLE, MIGRATE_ISOLATE); + __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved); } spin_unlock_irqrestore(&zone->lock, flags); @@ -5607,12 +5734,15 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype) { struct zone *zone; unsigned long flags; + int pages_moved; zone = page_zone(page); spin_lock_irqsave(&zone->lock, flags); if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) goto out; set_pageblock_migratetype(page, migratetype); - move_freepages_block(zone, page, migratetype); + pages_moved = hotplug_move_freepages_block(zone, page, + MIGRATE_ISOLATE, migratetype); + __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved); out: spin_unlock_irqrestore(&zone->lock, flags); } @@ -5900,9 +6030,6 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) #endif list_del(&page->lru); rmv_page_order(page); - zone->free_area[order].nr_free--; - __mod_zone_page_state(zone, NR_FREE_PAGES, - - (1UL << order)); for (i = 0; i < (1 << order); i++) SetPageReserved((page+i)); pfn += (1 << order);