On 06/21/2012 05:19 AM, KOSAKI Motohiro wrote: > (6/20/12 3:53 AM), Minchan Kim wrote: >> On 06/20/2012 03:32 PM, KOSAKI Motohiro wrote: >> >>> (6/20/12 2:12 AM), Minchan Kim wrote: >>>> >>>> Hi Aaditya, >>>> >>>> I want to discuss this problem on another thread. >>>> >>>> On 06/19/2012 10:18 PM, Aaditya Kumar wrote: >>>>> On Mon, Jun 18, 2012 at 6:13 AM, Minchan Kim <minchan@xxxxxxxxxx> wrote: >>>>>> On 06/17/2012 02:48 AM, Aaditya Kumar wrote: >>>>>> >>>>>>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote: >>>>>>> >>>>>>>>> >>>>>>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep >>>>>>>>> if node has multiple zones. Hm ok, I realized my descriptions was >>>>>>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls >>>>>>>>> pgdat_balanced() >>>>>>>>> every priority. Most easy case is, movable zone has a lot of free pages and >>>>>>>>> normal zone has no reclaimable page. >>>>>>>>> >>>>>>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should >>>>>>>>> sleep only if every zones have much free pages than high water mark >>>>>>>>> _and_ 25% of present pages in node are free. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Sorry. I can't understand your point. >>>>>>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark. >>>>>>>> It seems I am missing your point. >>>>>>>> Please anybody correct me. >>>>>>> >>>>>>> Since currently direct reclaim is given up based on >>>>>>> zone->all_unreclaimable flag, >>>>>>> so for e.g in one of the scenarios: >>>>>>> >>>>>>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we >>>>>>> hot-remove the all the pages of the MOVABLE zone. >>>>>>> >>>>>>> While migrating pages during memory hot-unplugging, the allocation function >>>>>>> (for new page to which the page in MOVABLE zone would be moved) can end up >>>>>>> looping in direct reclaim path for ever. >>>>>>> >>>>>>> This is so because when most of the pages in the MOVABLE zone have >>>>>>> been migrated, >>>>>>> the zone now contains lots of free memory (basically above low watermark) >>>>>>> BUT all are in MIGRATE_ISOLATE list of the buddy list. >>>>>>> >>>>>>> So kswapd() would not balance this zone as free pages are above low watermark >>>>>>> (but all are in isolate list). So zone->all_unreclaimable flag would >>>>>>> never be set for this zone >>>>>>> and allocation function would end up looping forever. (assuming the >>>>>>> zone NORMAL is >>>>>>> left with no reclaimable memory) >>>>>>> >>>>>> >>>>>> >>>>>> Thanks a lot, Aaditya! Scenario you mentioned makes perfect. >>>>>> But I don't see it's a problem of kswapd. >>>>> >>>>> Hi Kim, >>>> >>>> I like called Minchan rather than Kim >>>> Never mind. :) >>>> >>>>> >>>>> Yes I agree it is not a problem of kswapd. >>>> >>>> Yeb. >>>> >>>>> >>>>>> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list >>>>>> but we can't allocate it. :( >>>>>> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation. >>>>>> Kswapd is just one of them confused. >>>>>> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too. >>>>>> >>>>>> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free] >>>>> >>>>> >>>>> I assume that by the inconsistency you mention above, you mean >>>>> temporary inconsistency. >>>>> >>>>> Sorry, but IMHO as for memory hot plug the main issue with this patch >>>>> is that the inconsistency you mentioned above would NOT be a temporary >>>>> inconsistency. >>>>> >>>>> Every time say 'x' number of page frames are off lined, they will >>>>> introduce a difference of 'x' pages between >>>>> NR_FREE_PAGES and SumOf[free_area[order].nr_free]. >>>>> (So for e.g. if we do a frequent offline/online it will make >>>>> NR_FREE_PAGES negative) >>>>> >>>>> This is so because, unset_migratetype_isolate() is called from >>>>> offlining code (to set the migrate type of off lined pages again back >>>>> to MIGRATE_MOVABLE) >>>>> after the pages have been off lined and removed from the buddy list. >>>>> Since the pages for which unset_migratetype_isolate() is called are >>>>> not buddy pages so move_freepages_block() does not move any page, and >>>>> thus introducing a permanent inconsistency. >>>> >>>> Good point. Negative NR_FREE_PAGES is caused by double counting by my patch and __offline_isolated_pages. >>>> I think at first MIGRATE_ISOLATE type freed page shouldn't account as free page. >>>> >>>>> >>>>>> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect >>>>>> free_area[order].nr_free exactly. >>>>>> >>>>>> Any thought? >>>>> >>>>> As for fixing move_freepages_block(), At least for memory hot plug, >>>>> the pages stay in MIGRATE_ISOLATE list only for duration >>>>> offline_pages() function, >>>>> I mean only temporarily. Since fixing move_freepages_block() for will >>>>> introduce some overhead, So I am not very sure whether that overhead >>>>> is justified >>>>> for a temporary condition. What do you think? >>>> >>>> Yes. I don't like hurt fast path, either. >>>> How about this? (Passed just compile test :( ) >>>> The patch's goal is to NOT increase nr_free and NR_FREE_PAGES about freed page into MIGRATE_ISOLATED. >>>> >>>> This patch hurts high order page free path but I think it's not critical because higher order allocation >>>> is rare than order-0 allocation and we already have done same thing on free_hot_cold_page on order-0 free path >>>> which is more hot. >>> >>> Can't we change zone_water_mark_ok_safe() instead of page allocator? memory hotplug is really rare event. >> >> >> +1 >> >> Firstly, I want to make zone_page_state(z, NR_FREE_PAGES) itself more accurately because it is used by >> several places. As I looked over places, I can't find critical places except kswapd forever sleep case. >> So it's a nice idea! >> >> In that case, we need zone->lock whenever zone_watermark_ok_safe is called. >> Most of cases, it's unnecessary and it might hurt alloc/free performance when memory pressure is high. >> But if memory pressure is high, it may be already meaningless alloc/free performance. >> So it does make sense, IMHO. >> >> Please raise your hands if anyone has a concern about this. >> >> barrios@bbox:~/linux-next$ git diff >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index d2a515d..82cc0a2 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -1748,16 +1748,38 @@ bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, >> zone_page_state(z, NR_FREE_PAGES)); >> } >> >> -bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, >> +bool zone_watermark_ok_safe(struct zone *z, int alloc_order, unsigned long mark, >> int classzone_idx, int alloc_flags) >> { >> + struct free_area *area; >> + struct list_head *curr; >> + int order; >> + unsigned long flags; >> long free_pages = zone_page_state(z, NR_FREE_PAGES); >> >> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) >> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); >> >> - return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, >> - free_pages); >> + /* >> + * Memory hotplug/CMA can isolate freed page into MIGRATE_ISOLATE >> + * so that buddy can't allocate it although they are in free list. >> + */ >> + spin_lock_irqsave(&z->lock, flags); >> + for (order = 0; order < MAX_ORDER; order++) { >> + int count = 0; >> + area = &(z->free_area[order]); >> + if (unlikely(!list_empty(&area->free_list[MIGRATE_ISOLATE]))) { >> + list_for_each(curr, &area->free_list[MIGRATE_ISOLATE]) >> + count++; >> + free_pages -= (count << order); >> + } >> + } >> + if (free_pages < 0) >> + free_pages = 0; >> + spin_unlock_irqrestore(&z->lock, flags); >> + >> + return __zone_watermark_ok(z, alloc_order, mark, >> + classzone_idx, alloc_flags, free_pages); >> } > > number of isolate page block is almost always 0. then if we have such counter, > we almost always can avoid zone->lock. Just idea. Yeb. I thought about it but unfortunately we can't have a counter for MIGRATE_ISOLATE. Because we have to tweak in page free path for pages which are going to free later after we mark pageblock type to MIGRATE_ISOLATE. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>