On 06/20/2012 09:44 PM, Hillf Danton wrote: > On Wed, Jun 20, 2012 at 3:53 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote: >> On 06/20/2012 03:32 PM, KOSAKI Motohiro wrote: >> >>> (6/20/12 2:12 AM), Minchan Kim wrote: >>>> >>>> Hi Aaditya, >>>> >>>> I want to discuss this problem on another thread. >>>> >>>> On 06/19/2012 10:18 PM, Aaditya Kumar wrote: >>>>> On Mon, Jun 18, 2012 at 6:13 AM, Minchan Kim <minchan@xxxxxxxxxx> wrote: >>>>>> On 06/17/2012 02:48 AM, Aaditya Kumar wrote: >>>>>> >>>>>>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote: >>>>>>> >>>>>>>>> >>>>>>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep >>>>>>>>> if node has multiple zones. Hm ok, I realized my descriptions was >>>>>>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls >>>>>>>>> pgdat_balanced() >>>>>>>>> every priority. Most easy case is, movable zone has a lot of free pages and >>>>>>>>> normal zone has no reclaimable page. >>>>>>>>> >>>>>>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should >>>>>>>>> sleep only if every zones have much free pages than high water mark >>>>>>>>> _and_ 25% of present pages in node are free. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Sorry. I can't understand your point. >>>>>>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark. >>>>>>>> It seems I am missing your point. >>>>>>>> Please anybody correct me. >>>>>>> >>>>>>> Since currently direct reclaim is given up based on >>>>>>> zone->all_unreclaimable flag, >>>>>>> so for e.g in one of the scenarios: >>>>>>> >>>>>>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we >>>>>>> hot-remove the all the pages of the MOVABLE zone. >>>>>>> >>>>>>> While migrating pages during memory hot-unplugging, the allocation function >>>>>>> (for new page to which the page in MOVABLE zone would be moved) can end up >>>>>>> looping in direct reclaim path for ever. >>>>>>> >>>>>>> This is so because when most of the pages in the MOVABLE zone have >>>>>>> been migrated, >>>>>>> the zone now contains lots of free memory (basically above low watermark) >>>>>>> BUT all are in MIGRATE_ISOLATE list of the buddy list. >>>>>>> >>>>>>> So kswapd() would not balance this zone as free pages are above low watermark >>>>>>> (but all are in isolate list). So zone->all_unreclaimable flag would >>>>>>> never be set for this zone >>>>>>> and allocation function would end up looping forever. (assuming the >>>>>>> zone NORMAL is >>>>>>> left with no reclaimable memory) >>>>>>> >>>>>> >>>>>> >>>>>> Thanks a lot, Aaditya! Scenario you mentioned makes perfect. >>>>>> But I don't see it's a problem of kswapd. >>>>> >>>>> Hi Kim, >>>> >>>> I like called Minchan rather than Kim >>>> Never mind. :) >>>> >>>>> >>>>> Yes I agree it is not a problem of kswapd. >>>> >>>> Yeb. >>>> >>>>> >>>>>> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list >>>>>> but we can't allocate it. :( >>>>>> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation. >>>>>> Kswapd is just one of them confused. >>>>>> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too. >>>>>> >>>>>> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free] >>>>> >>>>> >>>>> I assume that by the inconsistency you mention above, you mean >>>>> temporary inconsistency. >>>>> >>>>> Sorry, but IMHO as for memory hot plug the main issue with this patch >>>>> is that the inconsistency you mentioned above would NOT be a temporary >>>>> inconsistency. >>>>> >>>>> Every time say 'x' number of page frames are off lined, they will >>>>> introduce a difference of 'x' pages between >>>>> NR_FREE_PAGES and SumOf[free_area[order].nr_free]. >>>>> (So for e.g. if we do a frequent offline/online it will make >>>>> NR_FREE_PAGES negative) >>>>> >>>>> This is so because, unset_migratetype_isolate() is called from >>>>> offlining code (to set the migrate type of off lined pages again back >>>>> to MIGRATE_MOVABLE) >>>>> after the pages have been off lined and removed from the buddy list. >>>>> Since the pages for which unset_migratetype_isolate() is called are >>>>> not buddy pages so move_freepages_block() does not move any page, and >>>>> thus introducing a permanent inconsistency. >>>> >>>> Good point. Negative NR_FREE_PAGES is caused by double counting by my patch and __offline_isolated_pages. >>>> I think at first MIGRATE_ISOLATE type freed page shouldn't account as free page. >>>> >>>>> >>>>>> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect >>>>>> free_area[order].nr_free exactly. >>>>>> >>>>>> Any thought? >>>>> >>>>> As for fixing move_freepages_block(), At least for memory hot plug, >>>>> the pages stay in MIGRATE_ISOLATE list only for duration >>>>> offline_pages() function, >>>>> I mean only temporarily. Since fixing move_freepages_block() for will >>>>> introduce some overhead, So I am not very sure whether that overhead >>>>> is justified >>>>> for a temporary condition. What do you think? >>>> >>>> Yes. I don't like hurt fast path, either. >>>> How about this? (Passed just compile test :( ) >>>> The patch's goal is to NOT increase nr_free and NR_FREE_PAGES about freed page into MIGRATE_ISOLATED. >>>> >>>> This patch hurts high order page free path but I think it's not critical because higher order allocation >>>> is rare than order-0 allocation and we already have done same thing on free_hot_cold_page on order-0 free path >>>> which is more hot. >>> >>> Can't we change zone_water_mark_ok_safe() instead of page allocator? memory hotplug is really rare event. >> >> >> +1 >> >> Firstly, I want to make zone_page_state(z, NR_FREE_PAGES) itself more accurately because it is used by >> several places. As I looked over places, I can't find critical places except kswapd forever sleep case. >> So it's a nice idea! >> >> In that case, we need zone->lock whenever zone_watermark_ok_safe is called. >> > Ifdefinery could be utilized for builds with CMA disabled, first. Yeb. if system doesn't use CMA and MEMORY_HOTPLUG, we can avoid it. I can do it in formal patch later. #if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG Still, my concern is that I'm not sure this approach is good or not. As I mentioned, MIGRATE_ISOLATE is very irony type because it's a type which represent NOT allocatable but they are in _free_ list, even they increase NR_FREE_PAGES and nr_free of free_area. So, if someone in future uses it, we have to add new CONFIG like this. #if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG || CONFIG_XXX If another someone try it, #if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG || CONFIG_XXX || CONFIG_YYY I know we can guess that user of MIGRATE_ISOLATE will be small. But who can make sure it? Important thing is we provide such interface and any user who want it can use it anytime. If users of NR_FREE_PAGES and nr_free are increased, they will get confused more than now. Sigh. IMHO, right approach is that we shouldn't account it at free page at the beginning although it adds a new condition check in higher order free page path. Look at my attached patch in previous mail. Of course, we can redesign isolated page machinery of hotplug but I have no idea. Because we can isolate pages which are already in freelist to another list instead of free_area. But the problem is we can't isolate pages which are going to free later, ie returned to buddy allocator after we mark pageblock to MIGRATE_ISOLATE. For it, we should touch page free patch which is hot path, too. > >> Most of cases, it's unnecessary and it might hurt alloc/free performance when memory pressure is high. >> But if memory pressure is high, it may be already meaningless alloc/free performance. >> So it does make sense, IMHO. >> >> Please raise your hands if anyone has a concern about this. >> >> barrios@bbox:~/linux-next$ git diff >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index d2a515d..82cc0a2 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -1748,16 +1748,38 @@ bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, >> zone_page_state(z, NR_FREE_PAGES)); >> } >> >> -bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, >> +bool zone_watermark_ok_safe(struct zone *z, int alloc_order, unsigned long mark, >> int classzone_idx, int alloc_flags) >> { >> + struct free_area *area; >> + struct list_head *curr; >> + int order; >> + unsigned long flags; >> long free_pages = zone_page_state(z, NR_FREE_PAGES); >> >> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) >> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); >> >> - return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, >> - free_pages); >> + /* >> + * Memory hotplug/CMA can isolate freed page into MIGRATE_ISOLATE >> + * so that buddy can't allocate it although they are in free list. >> + */ >> + spin_lock_irqsave(&z->lock, flags); >> + for (order = 0; order < MAX_ORDER; order++) { >> + int count = 0; >> + area = &(z->free_area[order]); >> + if (unlikely(!list_empty(&area->free_list[MIGRATE_ISOLATE]))) { >> + list_for_each(curr, &area->free_list[MIGRATE_ISOLATE]) >> + count++; >> + free_pages -= (count << order); >> + } >> + } >> + if (free_pages < 0) >> + free_pages = 0; >> + spin_unlock_irqrestore(&z->lock, flags); >> + >> + return __zone_watermark_ok(z, alloc_order, mark, >> + classzone_idx, alloc_flags, free_pages); >> } >> > Then isolated pages could be scanned in another direction? > > spin_lock_irqsave(&z->lock, flags); > for (order = MAX_ORDER - 1; order >= 0; order--) { > struct free_area *area = &z->free_area[order]; > long count = 0; > struct list_head *curr; > > list_for_each(curr, &area->free_list[MIGRATE_ISOLATE]) > count++; > > free_pages -= (count << order); > if (free_pages < 0) { > free_pages = 0; > break; > } > } > spin_unlock_irqrestore(&z->lock, flags); I'm not sure how it helps reducing loop enumeration but it's no problem. Anyway, before sending formal patch, I want to discuss above my concern. Thanks, Hillf. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>