(6/20/12 3:53 AM), Minchan Kim wrote: > On 06/20/2012 03:32 PM, KOSAKI Motohiro wrote: > >> (6/20/12 2:12 AM), Minchan Kim wrote: >>> >>> Hi Aaditya, >>> >>> I want to discuss this problem on another thread. >>> >>> On 06/19/2012 10:18 PM, Aaditya Kumar wrote: >>>> On Mon, Jun 18, 2012 at 6:13 AM, Minchan Kim <minchan@xxxxxxxxxx> wrote: >>>>> On 06/17/2012 02:48 AM, Aaditya Kumar wrote: >>>>> >>>>>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote: >>>>>> >>>>>>>> >>>>>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep >>>>>>>> if node has multiple zones. Hm ok, I realized my descriptions was >>>>>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls >>>>>>>> pgdat_balanced() >>>>>>>> every priority. Most easy case is, movable zone has a lot of free pages and >>>>>>>> normal zone has no reclaimable page. >>>>>>>> >>>>>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should >>>>>>>> sleep only if every zones have much free pages than high water mark >>>>>>>> _and_ 25% of present pages in node are free. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> Sorry. I can't understand your point. >>>>>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark. >>>>>>> It seems I am missing your point. >>>>>>> Please anybody correct me. >>>>>> >>>>>> Since currently direct reclaim is given up based on >>>>>> zone->all_unreclaimable flag, >>>>>> so for e.g in one of the scenarios: >>>>>> >>>>>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we >>>>>> hot-remove the all the pages of the MOVABLE zone. >>>>>> >>>>>> While migrating pages during memory hot-unplugging, the allocation function >>>>>> (for new page to which the page in MOVABLE zone would be moved) can end up >>>>>> looping in direct reclaim path for ever. >>>>>> >>>>>> This is so because when most of the pages in the MOVABLE zone have >>>>>> been migrated, >>>>>> the zone now contains lots of free memory (basically above low watermark) >>>>>> BUT all are in MIGRATE_ISOLATE list of the buddy list. >>>>>> >>>>>> So kswapd() would not balance this zone as free pages are above low watermark >>>>>> (but all are in isolate list). So zone->all_unreclaimable flag would >>>>>> never be set for this zone >>>>>> and allocation function would end up looping forever. (assuming the >>>>>> zone NORMAL is >>>>>> left with no reclaimable memory) >>>>>> >>>>> >>>>> >>>>> Thanks a lot, Aaditya! Scenario you mentioned makes perfect. >>>>> But I don't see it's a problem of kswapd. >>>> >>>> Hi Kim, >>> >>> I like called Minchan rather than Kim >>> Never mind. :) >>> >>>> >>>> Yes I agree it is not a problem of kswapd. >>> >>> Yeb. >>> >>>> >>>>> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list >>>>> but we can't allocate it. :( >>>>> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation. >>>>> Kswapd is just one of them confused. >>>>> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too. >>>>> >>>>> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free] >>>> >>>> >>>> I assume that by the inconsistency you mention above, you mean >>>> temporary inconsistency. >>>> >>>> Sorry, but IMHO as for memory hot plug the main issue with this patch >>>> is that the inconsistency you mentioned above would NOT be a temporary >>>> inconsistency. >>>> >>>> Every time say 'x' number of page frames are off lined, they will >>>> introduce a difference of 'x' pages between >>>> NR_FREE_PAGES and SumOf[free_area[order].nr_free]. >>>> (So for e.g. if we do a frequent offline/online it will make >>>> NR_FREE_PAGES negative) >>>> >>>> This is so because, unset_migratetype_isolate() is called from >>>> offlining code (to set the migrate type of off lined pages again back >>>> to MIGRATE_MOVABLE) >>>> after the pages have been off lined and removed from the buddy list. >>>> Since the pages for which unset_migratetype_isolate() is called are >>>> not buddy pages so move_freepages_block() does not move any page, and >>>> thus introducing a permanent inconsistency. >>> >>> Good point. Negative NR_FREE_PAGES is caused by double counting by my patch and __offline_isolated_pages. >>> I think at first MIGRATE_ISOLATE type freed page shouldn't account as free page. >>> >>>> >>>>> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect >>>>> free_area[order].nr_free exactly. >>>>> >>>>> Any thought? >>>> >>>> As for fixing move_freepages_block(), At least for memory hot plug, >>>> the pages stay in MIGRATE_ISOLATE list only for duration >>>> offline_pages() function, >>>> I mean only temporarily. Since fixing move_freepages_block() for will >>>> introduce some overhead, So I am not very sure whether that overhead >>>> is justified >>>> for a temporary condition. What do you think? >>> >>> Yes. I don't like hurt fast path, either. >>> How about this? (Passed just compile test :( ) >>> The patch's goal is to NOT increase nr_free and NR_FREE_PAGES about freed page into MIGRATE_ISOLATED. >>> >>> This patch hurts high order page free path but I think it's not critical because higher order allocation >>> is rare than order-0 allocation and we already have done same thing on free_hot_cold_page on order-0 free path >>> which is more hot. >> >> Can't we change zone_water_mark_ok_safe() instead of page allocator? memory hotplug is really rare event. > > > +1 > > Firstly, I want to make zone_page_state(z, NR_FREE_PAGES) itself more accurately because it is used by > several places. As I looked over places, I can't find critical places except kswapd forever sleep case. > So it's a nice idea! > > In that case, we need zone->lock whenever zone_watermark_ok_safe is called. > Most of cases, it's unnecessary and it might hurt alloc/free performance when memory pressure is high. > But if memory pressure is high, it may be already meaningless alloc/free performance. > So it does make sense, IMHO. > > Please raise your hands if anyone has a concern about this. > > barrios@bbox:~/linux-next$ git diff > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index d2a515d..82cc0a2 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1748,16 +1748,38 @@ bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, > zone_page_state(z, NR_FREE_PAGES)); > } > > -bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, > +bool zone_watermark_ok_safe(struct zone *z, int alloc_order, unsigned long mark, > int classzone_idx, int alloc_flags) > { > + struct free_area *area; > + struct list_head *curr; > + int order; > + unsigned long flags; > long free_pages = zone_page_state(z, NR_FREE_PAGES); > > if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) > free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); > > - return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, > - free_pages); > + /* > + * Memory hotplug/CMA can isolate freed page into MIGRATE_ISOLATE > + * so that buddy can't allocate it although they are in free list. > + */ > + spin_lock_irqsave(&z->lock, flags); > + for (order = 0; order < MAX_ORDER; order++) { > + int count = 0; > + area = &(z->free_area[order]); > + if (unlikely(!list_empty(&area->free_list[MIGRATE_ISOLATE]))) { > + list_for_each(curr, &area->free_list[MIGRATE_ISOLATE]) > + count++; > + free_pages -= (count << order); > + } > + } > + if (free_pages < 0) > + free_pages = 0; > + spin_unlock_irqrestore(&z->lock, flags); > + > + return __zone_watermark_ok(z, alloc_order, mark, > + classzone_idx, alloc_flags, free_pages); > } number of isolate page block is almost always 0. then if we have such counter, we almost always can avoid zone->lock. Just idea. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>