Re: [PATCH 2/2] memory-hotplug: fix kswapd looping forever problem

Mel Gorman <mel@xxxxxxxxx> · Mon, 25 Jun 2012 11:02:47 +0100

On Mon, Jun 25, 2012 at 01:59:27PM +0900, Minchan Kim wrote:
> When hotplug offlining happens on zone A, it starts to mark freed page
> as MIGRATE_ISOLATE type in buddy for preventing further allocation.
> (MIGRATE_ISOLATE is very irony type because it's apparently on buddy
> but we can't allocate them).
> When the memory shortage happens during hotplug offlining,
> current task starts to reclaim, then wake up kswapd.
> Kswapd checks watermark, then go sleep because current zone_watermark_ok_safe
> doesn't consider MIGRATE_ISOLATE freed page count.
> Current task continue to reclaim in direct reclaim path without kswapd's helping.
> The problem is that zone->all_unreclaimable is set by only kswapd
> so that current task would be looping forever like below.
> 
> __alloc_pages_slowpath
> restart:
> 	wake_all_kswapd
> rebalance:
> 	__alloc_pages_direct_reclaim
> 		do_try_to_free_pages
> 			if global_reclaim && !all_unreclaimable
> 				return 1; /* It means we did did_some_progress */
> 	skip __alloc_pages_may_oom
> 	should_alloc_retry
> 		goto rebalance;
> 
> If we apply KOSAKI's patch[1] which doesn't depends on kswapd
> about setting zone->all_unreclaimable, we can solve this problem
> by killing some task in direct reclaim path. But it doesn't wake up kswapd, still.
> It could be a problem still if other subsystem needs GFP_ATOMIC request.
> So kswapd should consider MIGRATE_ISOLATE when it calculate free pages
> BEFORE going sleep.
> 
> This patch counts the number of MIGRATE_ISOLATE page block and
> zone_watermark_ok_safe will consider it if the system has such blocks
> (fortunately, it's very rare so no problem in POV overhead and kswapd is never
> hotpath).
> 
> [1] http://lkml.org/lkml/2012/6/14/74
> 

I have not been following this discussion at all but reading through the
patch I wonder again why memory hotplug is not "allocating" the pageblocks
when they are fully isolated like a balloon driver. This would keep the
free space accounting as it is currently without introducing something
memory hotplug specific to kswapd.

I think historically that memory hotplug did not allocate pages because
it would be difficult to detect if a pageblock was isolated or if part of
some balloon. Allocating just full pageblocks would work around this.
However, it would play very badly with CMA.

It'd be worth mentioning this in the changelog in case someone tries to
"fix" it.

> Suggested-by: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
> Cc: Aaditya Kumar <aaditya.kumar.30@xxxxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxx>
> Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
> ---
> 
> Aaditya, coul you confirm this patch solve your problem and 
> make sure nr_migrate_isolate is zero after hotplug end?
> 
> Thanks!
> 
>  include/linux/mmzone.h |    8 ++++++++
>  mm/page_alloc.c        |   36 ++++++++++++++++++++++++++++++++++++
>  mm/page_isolation.c    |   43 +++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 85 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index bf3404e..290e186 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -474,6 +474,14 @@ struct zone {
>  	 * rarely used fields:
>  	 */
>  	const char		*name;
> +#ifdef CONFIG_MEMORY_ISOLATION
> +	/*
> +	 * the number of MIGRATE_ISOLATE pageblock
> +	 * We need this for accurate free page counting.
> +	 * It's protected by zone->lock
> +	 */
> +	atomic_t		nr_migrate_isolate;
> +#endif

If it's protected by the zone->lock then it does not need to be atomic.

>  } ____cacheline_internodealigned_in_smp;
>  
>  typedef enum {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c175fa9..626f877 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -218,6 +218,11 @@ EXPORT_SYMBOL(nr_online_nodes);
>  
>  int page_group_by_mobility_disabled __read_mostly;
>  
> +/*
> + * NOTE:
> + * Don't use set_pageblock_migratetype(page, MIGRATE_ISOLATE) direclty.

s/direclty/directly/

> + * Instead, use {un}set_pageblock_isolate.
> + */
>  void set_pageblock_migratetype(struct page *page, int migratetype)
>  {
>  	if (unlikely(page_group_by_mobility_disabled))
> @@ -1614,6 +1619,28 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>  	return true;
>  }
>  
> +static unsigned long migrate_isolate_pages(struct zone *zone)
> +{

This name is very misleading as nothing is being migrated. The
other zone-based counters are stored in vmstat[] and accessed
with zone_page_state(). I doubt you want to use vmstat but
nr_isolated_zone_pages() would have been a better name.

> +	unsigned long nr_pages = 0;
> +
> +	if (unlikely(atomic_read(&zone->nr_migrate_isolate))) {
> +		unsigned long flags;
> +		int order;
> +		spin_lock_irqsave(&zone->lock, flags);
> +		for (order = 0; order < MAX_ORDER; order++) {
> +			struct free_area *area = &zone->free_area[order];
> +			long count = 0;
> +			struct list_head *curr;
> +
> +			list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
> +				count++;
> +			nr_pages += (count << order);
> +		}
> +		spin_unlock_irqrestore(&zone->lock, flags);
> +	}

We have a zone->nr_migrate_isolate counter but have to search the buddy
lists to count how many pages are isolated? Don't bother. If the pageblocks
really have been isolated just assume that they are fully isolated for the
purposes of figuring out if kswapd should wake up or not. The consequences
of an inaccurate count is that kswapd wakes up when it potentially could
have stayed asleep but for memory hotplug that is desirable.

> +	return nr_pages;
> +}
> +
>  bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>  		      int classzone_idx, int alloc_flags)
>  {
> @@ -1629,6 +1656,14 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
>  	if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
>  		free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>  
> +	/*
> +	 * If the zone has MIGRATE_ISOLATE type free page,
> +	 * we should consider it, too. Otherwise, kswapd can sleep forever.
> +	 */
> +	free_pages -= migrate_isolate_pages(z);
> +	if (free_pages < 0)
> +		free_pages = 0;
> +

You are already taking into account that the numbet of isolated pages may
be inaccurate so an exact count in migrate_isolate_pages is unnecessary.

>  	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
>  								free_pages);
>  }
> @@ -4407,6 +4442,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>  		lruvec_init(&zone->lruvec, zone);
>  		zap_zone_vm_stats(zone);
>  		zone->flags = 0;
> +		atomic_set(&zone->nr_migrate_isolate, 0);
>  		if (!size)
>  			continue;
>  
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 1a9cb36..e95a792 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -8,6 +8,45 @@
>  #include <linux/memory.h>
>  #include "internal.h"
>  
> +static void set_pageblock_isolate(struct zone *zone, struct page *page)
> +{
> +	int old_migratetype;
> +	assert_spin_locked(&zone->lock);
> +
> +	if (unlikely(page_group_by_mobility_disabled)) {
> +		set_pageblock_flags_group(page, MIGRATE_UNMOVABLE,
> +				PB_migrate, PB_migrate_end);
> +		return;
> +	}

Not sure why this is necessary.

> +
> +	old_migratetype = get_pageblock_migratetype(page);
> +	set_pageblock_flags_group(page, MIGRATE_ISOLATE,
> +			PB_migrate, PB_migrate_end);
> +
> +	if (old_migratetype != MIGRATE_ISOLATE)
> +		atomic_inc(&zone->nr_migrate_isolate);
> +}

If the old type was MIGRATE_ISOLATE then it was also unnecessary to call
set_pageblock_flags_group() or anything else. It's also unnecessary to
pass in zone because you can figure it out from the page but no harm.

This could have been a lot easier to read with something like;

static void set_pageblock_isolate(struct zone *zone, struct page *page)
{
	BUG_ON(page_zone(page) ! = zone);
	if (get_pageblock_migratetype(page) == MIGRATE_ISOLATE)
		return;

	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
	zone->nr_pageblock_isolate++;
}

If set_migratetype_isolate is the only caller then it barely warrents a
helper functions because it simply looks like

if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) {
	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
	zone->nr_pageblock_isolate++;
}

> +
> +static void unset_pageblock_isolate(struct zone *zone, struct page *page,
> +		unsigned long migratetype)
> +{

The word "unset" in this context would normally refer to a boolean but
you're actually restoring an old value. Hence reset_pageblock_isolate or
restore_pageblock_migratetype would have been more appropriate.

Oh, I see you are matching the naming of unset_migratetype_isolate().
That sucks, hope it was not me that suggested that name originally :/

migratetype is almost always int too, not sure where that unsigned long
came out of.

> +	assert_spin_locked(&zone->lock);
> +
> +	if (unlikely(page_group_by_mobility_disabled)) {
> +		set_pageblock_flags_group(page, migratetype,
> +				PB_migrate, PB_migrate_end);
> +		return;
> +	}
> +
> +	BUG_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE);
> +	BUG_ON(migratetype == MIGRATE_ISOLATE);
> +
> +	set_pageblock_flags_group(page, migratetype,
> +			PB_migrate, PB_migrate_end);
> +	atomic_dec(&zone->nr_migrate_isolate);
> +	BUG_ON(atomic_read(&zone->nr_migrate_isolate) < 0);
> +}
> +

static void reset_pageblock_isolate(struct zone *zone, struct page *page,
				    int migratetype)
{
	if (WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE))
		return;

	BUG_ON(zone->nr_pageblock_isolate == 0);
	set_pageblock_migratetype(page, migratetype);
	zone->nr_pageblock_isolate--;
}

>  int set_migratetype_isolate(struct page *page)
>  {
>  	struct zone *zone;
> @@ -54,7 +93,7 @@ int set_migratetype_isolate(struct page *page)
>  
>  out:
>  	if (!ret) {
> -		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> +		set_pageblock_isolate(zone, page);
>  		move_freepages_block(zone, page, MIGRATE_ISOLATE);
>  	}
>  
> @@ -72,8 +111,8 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>  	spin_lock_irqsave(&zone->lock, flags);
>  	if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
>  		goto out;
> -	set_pageblock_migratetype(page, migratetype);
>  	move_freepages_block(zone, page, migratetype);
> +	unset_pageblock_isolate(zone, page, migratetype);
>  out:
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  }

While this patch looks like it would work as advertised I also think that
it is more complicated than it needs to be. The use of atomics and exact
counts of isolated pages look unnecessary.  set_migeratetype_isolate and
unset_pageblock_isolate could have been a lot easier to read.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>