Re: [PATCH] vmscan: Fix do_try_to_free_pages() return value when priority==0 reclaim failure

Greg Thelen <gthelen@xxxxxxxxxx> · Tue, 01 Jun 2010 07:50:51 -0700

KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx> writes:
> CC to memcg folks.
>
>> I agree with the direction of this patch, but I am seeing a hang when
>> testing with mmotm-2010-05-21-16-05.  The following test hangs, unless I
>> remove this patch from mmotm:
>>   mount -t cgroup none /cgroups -o memory
>>   mkdir /cgroups/cg1
>>   echo $$ > /cgroups/cg1/tasks
>>   dd bs=1024 count=1024 if=/dev/null of=/data/foo
>>   echo $$ > /cgroups/tasks
>>   echo 1 > /cgroups/cg1/memory.force_empty
>> 
>> I think the hang is caused by the following portion of
>> mem_cgroup_force_empty():
>> 	while (nr_retries && mem->res.usage > 0) {
>> 		int progress;
>> 
>> 		if (signal_pending(current)) {
>> 			ret = -EINTR;
>> 			goto out;
>> 		}
>> 		progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
>> 						false, get_swappiness(mem));
>> 		if (!progress) {
>> 			nr_retries--;
>> 			/* maybe some writeback is necessary */
>> 			congestion_wait(BLK_RW_ASYNC, HZ/10);
>> 		}
>> 
>> 	}
>> 
>> With this patch applied, it is possible that when do_try_to_free_pages()
>> calls shrink_zones() for priority 0 that shrink_zones() may return 1
>> indicating progress, even though no pages may have been reclaimed.
>> Because this is a cgroup operation, scanning_global_lru() is false and
>> the following portion of do_try_to_free_pages() fails to set ret=0.
>> > 	if (ret && scanning_global_lru(sc))
>> >  		ret = sc->nr_reclaimed;
>> This leaves ret=1 indicating that do_try_to_free_pages() reclaimed 1
>> page even though it did not reclaim any pages.  Therefore
>> mem_cgroup_force_empty() erroneously believes that
>> try_to_free_mem_cgroup_pages() is making progress (one page at a time),
>> so there is an endless loop.
>
> Good catch!
>
> Yeah, your analysis is fine. thank you for both your testing and
> making analysis.
>
> Unfortunatelly, this logic need more fix. because It have already been
> corrupted by another regression. my point is, if priority==0 reclaim 
> failure occur, "ret = sc->nr_reclaimed" makes no sense at all.
>
> The fixing patch is here. What do you think?

Your fixing patch looks good and passes the test which detected the
issue.  Thank you.

Tested-by: Greg Thelen <gthelen@xxxxxxxxxx>

> From 49a395b21fe1b2f864112e71d027ffcafbdc9fc0 Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
> Date: Tue, 1 Jun 2010 11:29:50 +0900
> Subject: [PATCH] vmscan: Fix do_try_to_free_pages() return value when priority==0 reclaim failure
>
> Greg Thelen reported recent Johannes's stack diet patch makes kernel
> hang. His test is following.
>
>   mount -t cgroup none /cgroups -o memory
>   mkdir /cgroups/cg1
>   echo $$ > /cgroups/cg1/tasks
>   dd bs=1024 count=1024 if=/dev/null of=/data/foo
>   echo $$ > /cgroups/tasks
>   echo 1 > /cgroups/cg1/memory.force_empty
>
> Actually, This OOM hard to try logic have been corrupted
> since following two years old patch.
>
> 	commit a41f24ea9fd6169b147c53c2392e2887cc1d9247
> 	Author: Nishanth Aravamudan <nacc@xxxxxxxxxx>
> 	Date:   Tue Apr 29 00:58:25 2008 -0700
>
> 	    page allocator: smarter retry of costly-order allocations
>
> Original intention was "return success if the system have shrinkable
> zones though priority==0 reclaim was failure". But the above patch
> changed to "return nr_reclaimed if .....". Oh, That forgot nr_reclaimed
> may be 0 if priority==0 reclaim failure.
>
> And Johannes's patch made more corrupt. Originally, priority==0 recliam
> failure on memcg return 0, but this patch changed to return 1. It
> totally confused memcg.
>
> This patch fixes it completely.
>
> Reported-by: Greg Thelen <gthelen@xxxxxxxxxx>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> ---
>  mm/vmscan.c |   29 ++++++++++++++++-------------
>  1 files changed, 16 insertions(+), 13 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 915dceb..a204209 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1724,13 +1724,13 @@ static void shrink_zone(int priority, struct zone *zone,
>   * If a zone is deemed to be full of pinned pages then just give it a light
>   * scan then give up on it.
>   */
> -static int shrink_zones(int priority, struct zonelist *zonelist,
> +static bool shrink_zones(int priority, struct zonelist *zonelist,
>  					struct scan_control *sc)
>  {
>  	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
>  	struct zoneref *z;
>  	struct zone *zone;
> -	int progress = 0;
> +	bool all_unreclaimable = true;
>  
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
>  					sc->nodemask) {
> @@ -1757,9 +1757,9 @@ static int shrink_zones(int priority, struct zonelist *zonelist,
>  		}
>  
>  		shrink_zone(priority, zone, sc);
> -		progress = 1;
> +		all_unreclaimable = false;
>  	}
> -	return progress;
> +	return all_unreclaimable;
>  }
>  
>  /*
> @@ -1782,7 +1782,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  					struct scan_control *sc)
>  {
>  	int priority;
> -	unsigned long ret = 0;
> +	bool all_unreclaimable; 
>  	unsigned long total_scanned = 0;
>  	struct reclaim_state *reclaim_state = current->reclaim_state;
>  	unsigned long lru_pages = 0;
> @@ -1813,7 +1813,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		sc->nr_scanned = 0;
>  		if (!priority)
>  			disable_swap_token();
> -		ret = shrink_zones(priority, zonelist, sc);
> +		all_unreclaimable = shrink_zones(priority, zonelist, sc);
>  		/*
>  		 * Don't shrink slabs when reclaiming memory from
>  		 * over limit cgroups
> @@ -1826,10 +1826,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  			}
>  		}
>  		total_scanned += sc->nr_scanned;
> -		if (sc->nr_reclaimed >= sc->nr_to_reclaim) {
> -			ret = sc->nr_reclaimed;
> +		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
>  			goto out;
> -		}
>  
>  		/*
>  		 * Try to write back as many pages as we just scanned.  This
> @@ -1849,9 +1847,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		    priority < DEF_PRIORITY - 2)
>  			congestion_wait(BLK_RW_ASYNC, HZ/10);
>  	}
> -	/* top priority shrink_zones still had more to do? don't OOM, then */
> -	if (ret && scanning_global_lru(sc))
> -		ret = sc->nr_reclaimed;
> +
>  out:
>  	/*
>  	 * Now that we've scanned all the zones at this priority level, note
> @@ -1877,7 +1873,14 @@ out:
>  	delayacct_freepages_end();
>  	put_mems_allowed();
>  
> -	return ret;
> +	if (sc->nr_reclaimed)
> +		return sc->nr_reclaimed;
> +
> +	/* top priority shrink_zones still had more to do? don't OOM, then */
> +	if (scanning_global_lru(sc) && !all_unreclaimable)
> +		return 1;
> +
> +	return 0;
>  }
>  
>  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> -- 
> 1.6.5.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>