Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated

Vlastimil Babka <vbabka@xxxxxxx> · Thu, 12 Feb 2015 17:19:10 +0100

On 02/11/2015 11:14 PM, Andrew Morton wrote:
> 
> Did we end up deciding to merge this, or is
> http://ozlabs.org/~akpm/mmots/broken-out/vmstat-do-not-use-deferrable-delayed-work-for-vmstat_update.patch
> a sufficient fix?

I think Michal wanted to have the general vmstat worker fix from elsewhere in
the thread tested, if it solves the problem by itself, without this patch.

> From: Vinayak Menon <vinmenon@xxxxxxxxxxxxxx>
> Subject: mm: vmscan: fix the page state calculation in too_many_isolated
> 
> It is observed that sometimes multiple tasks get blocked for long in the
> congestion_wait loop below, in shrink_inactive_list.  This is because of
> vm_stat values not being synced.
> 
> (__schedule) from [<c0a03328>]
> (schedule_timeout) from [<c0a04940>]
> (io_schedule_timeout) from [<c01d585c>]
> (congestion_wait) from [<c01cc9d8>]
> (shrink_inactive_list) from [<c01cd034>]
> (shrink_zone) from [<c01cdd08>]
> (try_to_free_pages) from [<c01c442c>]
> (__alloc_pages_nodemask) from [<c01f1884>]
> (new_slab) from [<c09fcf60>]
> (__slab_alloc) from [<c01f1a6c>]
> 
> In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) had returned
> 14, zone_page_state(zone, NR_INACTIVE_FILE) returned 92, and GFP_IOFS was
> set, and this resulted in too_many_isolated returning true.  But one of
> the CPU's pageset vm_stat_diff had NR_ISOLATED_FILE as "-14".  So the
> actual isolated count was zero.  As there weren't any more updates to
> NR_ISOLATED_FILE and vmstat_update deffered work had not been scheduled
> yet, 7 tasks were spinning in the congestion wait loop for around 4
> seconds, in the direct reclaim path.
> 
> This patch uses zone_page_state_snapshot instead, but restricts its usage
> to avoid performance penalty.
> 
> 
> The vmstat sync interval is HZ (sysctl_stat_interval), but since the
> vmstat_work is declared as a deferrable work, the timer trigger can be
> deferred to the next non-defferable timer expiry on the CPU which is in
> idle.  This results in the vmstat syncing on an idle CPU being delayed by
> seconds.  May be in most cases this behavior is fine, except in cases like
> this.
> 
> [akpm@xxxxxxxxxxxxxxxxxxxx: move zone_page_state_snapshot() fallback logic into too_many_isolated()]
> Signed-off-by: Vinayak Menon <vinmenon@xxxxxxxxxxxxxx>
> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> Cc: Vladimir Davydov <vdavydov@xxxxxxxxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxx>
> Cc: Minchan Kim <minchan@xxxxxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> ---
> 
>  mm/vmscan.c |   51 +++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 34 insertions(+), 17 deletions(-)
> 
> diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated mm/vmscan.c
> --- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated
> +++ a/mm/vmscan.c
> @@ -1363,6 +1363,32 @@ int isolate_lru_page(struct page *page)
>  	return ret;
>  }
>  
> +static int __too_many_isolated(struct zone *zone, int file,
> +			       struct scan_control *sc, int safe)
> +{
> +	unsigned long inactive, isolated;
> +
> +	if (safe) {
> +		inactive = zone_page_state_snapshot(zone,
> +				NR_INACTIVE_ANON + 2 * file);
> +		isolated = zone_page_state_snapshot(zone,
> +				NR_ISOLATED_ANON + file);
> +	} else {
> +		inactive = zone_page_state(zone, NR_INACTIVE_ANON + 2 * file);
> +		isolated = zone_page_state(zone, NR_ISOLATED_ANON + file);
> +	}
> +
> +	/*
> +	 * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they
> +	 * won't get blocked by normal direct-reclaimers, forming a circular
> +	 * deadlock.
> +	 */
> +	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> +		inactive >>= 3;
> +
> +	return isolated > inactive;
> +}
> +
>  /*
>   * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and
>   * then get resheduled. When there are massive number of tasks doing page
> @@ -1371,33 +1397,24 @@ int isolate_lru_page(struct page *page)
>   * unnecessary swapping, thrashing and OOM.
>   */
>  static int too_many_isolated(struct zone *zone, int file,
> -		struct scan_control *sc)
> +			     struct scan_control *sc)
>  {
> -	unsigned long inactive, isolated;
> -
>  	if (current_is_kswapd())
>  		return 0;
>  
>  	if (!global_reclaim(sc))
>  		return 0;
>  
> -	if (file) {
> -		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> -		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
> -	} else {
> -		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> -		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> -	}
> -
>  	/*
> -	 * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they
> -	 * won't get blocked by normal direct-reclaimers, forming a circular
> -	 * deadlock.
> +	 * __too_many_isolated(safe=0) is fast but inaccurate, because it
> +	 * doesn't account for the vm_stat_diff[] counters.  So if it looks
> +	 * like too_many_isolated() is about to return true, fall back to the
> +	 * slower, more accurate zone_page_state_snapshot().
>  	 */
> -	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> -		inactive >>= 3;
> +	if (unlikely(__too_many_isolated(zone, file, sc, 0)))
> +		return __too_many_isolated(zone, file, sc, 1);
>  
> -	return isolated > inactive;
> +	return 0;
>  }
>  
>  static noinline_for_stack void
> _
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>