Re: [patch 7/9] mm: thrash detection-based file cache sizing

Minchan Kim <minchan@xxxxxxxxxx> · Mon, 13 Jan 2014 11:42:27 +0900

On Fri, Jan 10, 2014 at 01:10:41PM -0500, Johannes Weiner wrote:
> The VM maintains cached filesystem pages on two types of lists.  One
> list holds the pages recently faulted into the cache, the other list
> holds pages that have been referenced repeatedly on that first list.
> The idea is to prefer reclaiming young pages over those that have
> shown to benefit from caching in the past.  We call the recently used
> list "inactive list" and the frequently used list "active list".
> 
> Currently, the VM aims for a 1:1 ratio between the lists, which is the
> "perfect" trade-off between the ability to *protect* frequently used
> pages and the ability to *detect* frequently used pages.  This means
> that working set changes bigger than half of cache memory go
> undetected and thrash indefinitely, whereas working sets bigger than
> half of cache memory are unprotected against used-once streams that
> don't even need caching.
> 
> Historically, every reclaim scan of the inactive list also took a
> smaller number of pages from the tail of the active list and moved
> them to the head of the inactive list.  This model gave established
> working sets more gracetime in the face of temporary use-once streams,
> but ultimately was not significantly better than a FIFO policy and
> still thrashed cache based on eviction speed, rather than actual
> demand for cache.
> 
> This patch solves one half of the problem by decoupling the ability to
> detect working set changes from the inactive list size.  By
> maintaining a history of recently evicted file pages it can detect
> frequently used pages with an arbitrarily small inactive list size,
> and subsequently apply pressure on the active list based on actual
> demand for cache, not just overall eviction speed.
> 
> Every zone maintains a counter that tracks inactive list aging speed.
> When a page is evicted, a snapshot of this counter is stored in the
> now-empty page cache radix tree slot.  On refault, the minimum access
> distance of the page can be assessed, to evaluate whether the page
> should be part of the active list or not.
> 
> This fixes the VM's blindness towards working set changes in excess of
> the inactive list.  And it's the foundation to further improve the
> protection ability and reduce the minimum inactive list size of 50%.
> 
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
Reviewed-by: Minchan Kim <minchan@xxxxxxxxxx>

Really nice description and code to understand.
I believe we should really merge/maintain such advanced algorithm,
which will end up putting more advanced concept easily.

Johannes, thanks for the your effort!

> ---
>  include/linux/mmzone.h |   5 +
>  include/linux/swap.h   |   5 +
>  mm/Makefile            |   2 +-
>  mm/filemap.c           |  61 ++++++++----
>  mm/swap.c              |   2 +
>  mm/vmscan.c            |  24 ++++-
>  mm/vmstat.c            |   2 +
>  mm/workingset.c        | 253 +++++++++++++++++++++++++++++++++++++++++++++++++
>  8 files changed, 331 insertions(+), 23 deletions(-)
>  create mode 100644 mm/workingset.c
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index bd791e452ad7..118ba9f51e86 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -142,6 +142,8 @@ enum zone_stat_item {
>  	NUMA_LOCAL,		/* allocation from local node */
>  	NUMA_OTHER,		/* allocation from other node */
>  #endif
> +	WORKINGSET_REFAULT,
> +	WORKINGSET_ACTIVATE,
>  	NR_ANON_TRANSPARENT_HUGEPAGES,
>  	NR_FREE_CMA_PAGES,
>  	NR_VM_ZONE_STAT_ITEMS };
> @@ -392,6 +394,9 @@ struct zone {
>  	spinlock_t		lru_lock;
>  	struct lruvec		lruvec;
>  
> +	/* Evictions & activations on the inactive file list */
> +	atomic_long_t		inactive_age;
> +
>  	unsigned long		pages_scanned;	   /* since last reclaim */
>  	unsigned long		flags;		   /* zone flags, see below */
>  
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 46ba0c6c219f..b83cf61403ed 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -260,6 +260,11 @@ struct swap_list_t {
>  	int next;	/* swapfile to be used next */
>  };
>  
> +/* linux/mm/workingset.c */
> +void *workingset_eviction(struct address_space *mapping, struct page *page);
> +bool workingset_refault(void *shadow);
> +void workingset_activation(struct page *page);
> +
>  /* linux/mm/page_alloc.c */
>  extern unsigned long totalram_pages;
>  extern unsigned long totalreserve_pages;
> diff --git a/mm/Makefile b/mm/Makefile
> index 305d10acd081..b30aeb86abd6 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -17,7 +17,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   util.o mmzone.o vmstat.o backing-dev.o \
>  			   mm_init.o mmu_context.o percpu.o slab_common.o \
>  			   compaction.o balloon_compaction.o \
> -			   interval_tree.o list_lru.o $(mmu-y)
> +			   interval_tree.o list_lru.o workingset.o $(mmu-y)
>  
>  obj-y += init-mm.o
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d02db5801dda..65a374c0df4f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -469,7 +469,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
>  EXPORT_SYMBOL_GPL(replace_page_cache_page);
>  
>  static int page_cache_tree_insert(struct address_space *mapping,
> -				  struct page *page)
> +				  struct page *page, void **shadowp)
>  {
>  	void **slot;
>  	int error;
> @@ -484,6 +484,8 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  		radix_tree_replace_slot(slot, page);
>  		mapping->nrshadows--;
>  		mapping->nrpages++;
> +		if (shadowp)
> +			*shadowp = p;
>  		return 0;
>  	}
>  	error = radix_tree_insert(&mapping->page_tree, page->index, page);
> @@ -492,18 +494,10 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  	return error;
>  }
>  
> -/**
> - * add_to_page_cache_locked - add a locked page to the pagecache
> - * @page:	page to add
> - * @mapping:	the page's address_space
> - * @offset:	page index
> - * @gfp_mask:	page allocation mode
> - *
> - * This function is used to add a page to the pagecache. It must be locked.
> - * This function does not add the page to the LRU.  The caller must do that.
> - */
> -int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> -		pgoff_t offset, gfp_t gfp_mask)
> +static int __add_to_page_cache_locked(struct page *page,
> +				      struct address_space *mapping,
> +				      pgoff_t offset, gfp_t gfp_mask,
> +				      void **shadowp)
>  {
>  	int error;
>  
> @@ -526,7 +520,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  	page->index = offset;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	error = page_cache_tree_insert(mapping, page);
> +	error = page_cache_tree_insert(mapping, page, shadowp);
>  	radix_tree_preload_end();
>  	if (unlikely(error))
>  		goto err_insert;
> @@ -542,16 +536,49 @@ err_insert:
>  	page_cache_release(page);
>  	return error;
>  }
> +
> +/**
> + * add_to_page_cache_locked - add a locked page to the pagecache
> + * @page:	page to add
> + * @mapping:	the page's address_space
> + * @offset:	page index
> + * @gfp_mask:	page allocation mode
> + *
> + * This function is used to add a page to the pagecache. It must be locked.
> + * This function does not add the page to the LRU.  The caller must do that.
> + */
> +int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> +		pgoff_t offset, gfp_t gfp_mask)
> +{
> +	return __add_to_page_cache_locked(page, mapping, offset,
> +					  gfp_mask, NULL);
> +}
>  EXPORT_SYMBOL(add_to_page_cache_locked);
>  
>  int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>  				pgoff_t offset, gfp_t gfp_mask)
>  {
> +	void *shadow = NULL;
>  	int ret;
>  
> -	ret = add_to_page_cache(page, mapping, offset, gfp_mask);
> -	if (ret == 0)
> -		lru_cache_add_file(page);
> +	__set_page_locked(page);
> +	ret = __add_to_page_cache_locked(page, mapping, offset,
> +					 gfp_mask, &shadow);
> +	if (unlikely(ret))
> +		__clear_page_locked(page);
> +	else {
> +		/*
> +		 * The page might have been evicted from cache only
> +		 * recently, in which case it should be activated like
> +		 * any other repeatedly accessed page.
> +		 */
> +		if (shadow && workingset_refault(shadow)) {
> +			SetPageActive(page);
> +			workingset_activation(page);
> +		} else
> +			ClearPageActive(page);
> +		lru_cache_add(page);
> +	}
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
> diff --git a/mm/swap.c b/mm/swap.c
> index f624e5b4b724..ece5c49d6364 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -519,6 +519,8 @@ void mark_page_accessed(struct page *page)
>  		else
>  			__lru_cache_activate_page(page);
>  		ClearPageReferenced(page);
> +		if (page_is_file_cache(page))
> +			workingset_activation(page);
>  	} else if (!PageReferenced(page)) {
>  		SetPageReferenced(page);
>  	}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b954b31602cf..0d3c3d7f8c1b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -505,7 +505,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
>   * Same as remove_mapping, but if the page is removed from the mapping, it
>   * gets returned with a refcount of 0.
>   */
> -static int __remove_mapping(struct address_space *mapping, struct page *page)
> +static int __remove_mapping(struct address_space *mapping, struct page *page,
> +			    bool reclaimed)
>  {
>  	BUG_ON(!PageLocked(page));
>  	BUG_ON(mapping != page_mapping(page));
> @@ -551,10 +552,23 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
>  		swapcache_free(swap, page);
>  	} else {
>  		void (*freepage)(struct page *);
> +		void *shadow = NULL;
>  
>  		freepage = mapping->a_ops->freepage;
> -
> -		__delete_from_page_cache(page, NULL);
> +		/*
> +		 * Remember a shadow entry for reclaimed file cache in
> +		 * order to detect refaults, thus thrashing, later on.
> +		 *
> +		 * But don't store shadows in an address space that is
> +		 * already exiting.  This is not just an optizimation,
> +		 * inode reclaim needs to empty out the radix tree or
> +		 * the nodes are lost.  Don't plant shadows behind its
> +		 * back.
> +		 */
> +		if (reclaimed && page_is_file_cache(page) &&
> +		    !mapping_exiting(mapping))
> +			shadow = workingset_eviction(mapping, page);
> +		__delete_from_page_cache(page, shadow);
>  		spin_unlock_irq(&mapping->tree_lock);
>  		mem_cgroup_uncharge_cache_page(page);
>  
> @@ -577,7 +591,7 @@ cannot_free:
>   */
>  int remove_mapping(struct address_space *mapping, struct page *page)
>  {
> -	if (__remove_mapping(mapping, page)) {
> +	if (__remove_mapping(mapping, page, false)) {
>  		/*
>  		 * Unfreezing the refcount with 1 rather than 2 effectively
>  		 * drops the pagecache ref for us without requiring another
> @@ -1047,7 +1061,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -		if (!mapping || !__remove_mapping(mapping, page))
> +		if (!mapping || !__remove_mapping(mapping, page, true))
>  			goto keep_locked;
>  
>  		/*
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 9bb314577911..3ac830d1b533 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -770,6 +770,8 @@ const char * const vmstat_text[] = {
>  	"numa_local",
>  	"numa_other",
>  #endif
> +	"workingset_refault",
> +	"workingset_activate",
>  	"nr_anon_transparent_hugepages",
>  	"nr_free_cma",
>  	"nr_dirty_threshold",
> diff --git a/mm/workingset.c b/mm/workingset.c
> new file mode 100644
> index 000000000000..8a6c7cff4923
> --- /dev/null
> +++ b/mm/workingset.c
> @@ -0,0 +1,253 @@
> +/*
> + * Workingset detection
> + *
> + * Copyright (C) 2013 Red Hat, Inc., Johannes Weiner
> + */
> +
> +#include <linux/memcontrol.h>
> +#include <linux/writeback.h>
> +#include <linux/pagemap.h>
> +#include <linux/atomic.h>
> +#include <linux/module.h>
> +#include <linux/swap.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +
> +/*
> + *		Double CLOCK lists
> + *
> + * Per zone, two clock lists are maintained for file pages: the
> + * inactive and the active list.  Freshly faulted pages start out at
> + * the head of the inactive list and page reclaim scans pages from the
> + * tail.  Pages that are accessed multiple times on the inactive list
> + * are promoted to the active list, to protect them from reclaim,
> + * whereas active pages are demoted to the inactive list when the
> + * active list grows too big.
> + *
> + *   fault ------------------------+
> + *                                 |
> + *              +--------------+   |            +-------------+
> + *   reclaim <- |   inactive   | <-+-- demotion |    active   | <--+
> + *              +--------------+                +-------------+    |
> + *                     |                                           |
> + *                     +-------------- promotion ------------------+
> + *
> + *
> + *		Access frequency and refault distance
> + *
> + * A workload is thrashing when its pages are frequently used but they
> + * are evicted from the inactive list every time before another access
> + * would have promoted them to the active list.
> + *
> + * In cases where the average access distance between thrashing pages
> + * is bigger than the size of memory there is nothing that can be
> + * done - the thrashing set could never fit into memory under any
> + * circumstance.
> + *
> + * However, the average access distance could be bigger than the
> + * inactive list, yet smaller than the size of memory.  In this case,
> + * the set could fit into memory if it weren't for the currently
> + * active pages - which may be used more, hopefully less frequently:
> + *
> + *      +-memory available to cache-+
> + *      |                           |
> + *      +-inactive------+-active----+
> + *  a b | c d e f g h i | J K L M N |
> + *      +---------------+-----------+
> + *
> + * It is prohibitively expensive to accurately track access frequency
> + * of pages.  But a reasonable approximation can be made to measure
> + * thrashing on the inactive list, after which refaulting pages can be
> + * activated optimistically to compete with the existing active pages.
> + *
> + * Approximating inactive page access frequency - Observations:
> + *
> + * 1. When a page is accessed for the first time, it is added to the
> + *    head of the inactive list, slides every existing inactive page
> + *    towards the tail by one slot, and pushes the current tail page
> + *    out of memory.
> + *
> + * 2. When a page is accessed for the second time, it is promoted to
> + *    the active list, shrinking the inactive list by one slot.  This
> + *    also slides all inactive pages that were faulted into the cache
> + *    more recently than the activated page towards the tail of the
> + *    inactive list.
> + *
> + * Thus:
> + *
> + * 1. The sum of evictions and activations between any two points in
> + *    time indicate the minimum number of inactive pages accessed in
> + *    between.
> + *
> + * 2. Moving one inactive page N page slots towards the tail of the
> + *    list requires at least N inactive page accesses.
> + *
> + * Combining these:
> + *
> + * 1. When a page is finally evicted from memory, the number of
> + *    inactive pages accessed while the page was in cache is at least
> + *    the number of page slots on the inactive list.
> + *
> + * 2. In addition, measuring the sum of evictions and activations (E)
> + *    at the time of a page's eviction, and comparing it to another
> + *    reading (R) at the time the page faults back into memory tells
> + *    the minimum number of accesses while the page was not cached.
> + *    This is called the refault distance.
> + *
> + * Because the first access of the page was the fault and the second
> + * access the refault, we combine the in-cache distance with the
> + * out-of-cache distance to get the complete minimum access distance
> + * of this page:
> + *
> + *      NR_inactive + (R - E)
> + *
> + * And knowing the minimum access distance of a page, we can easily
> + * tell if the page would be able to stay in cache assuming all page
> + * slots in the cache were available:
> + *
> + *   NR_inactive + (R - E) <= NR_inactive + NR_active
> + *
> + * which can be further simplified to
> + *
> + *   (R - E) <= NR_active
> + *
> + * Put into words, the refault distance (out-of-cache) can be seen as
> + * a deficit in inactive list space (in-cache).  If the inactive list
> + * had (R - E) more page slots, the page would not have been evicted
> + * in between accesses, but activated instead.  And on a full system,
> + * the only thing eating into inactive list space is active pages.
> + *
> + *
> + *		Activating refaulting pages
> + *
> + * All that is known about the active list is that the pages have been
> + * accessed more than once in the past.  This means that at any given
> + * time there is actually a good chance that pages on the active list
> + * are no longer in active use.
> + *
> + * So when a refault distance of (R - E) is observed and there are at
> + * least (R - E) active pages, the refaulting page is activated
> + * optimistically in the hope that (R - E) active pages are actually
> + * used less frequently than the refaulting page - or even not used at
> + * all anymore.
> + *
> + * If this is wrong and demotion kicks in, the pages which are truly
> + * used more frequently will be reactivated while the less frequently
> + * used once will be evicted from memory.
> + *
> + * But if this is right, the stale pages will be pushed out of memory
> + * and the used pages get to stay in cache.
> + *
> + *
> + *		Implementation
> + *
> + * For each zone's file LRU lists, a counter for inactive evictions
> + * and activations is maintained (zone->inactive_age).
> + *
> + * On eviction, a snapshot of this counter (along with some bits to
> + * identify the zone) is stored in the now empty page cache radix tree
> + * slot of the evicted page.  This is called a shadow entry.
> + *
> + * On cache misses for which there are shadow entries, an eligible
> + * refault distance will immediately activate the refaulting page.
> + */
> +
> +static void *pack_shadow(unsigned long eviction, struct zone *zone)
> +{
> +	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
> +	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
> +	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
> +
> +	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
> +}
> +
> +static void unpack_shadow(void *shadow,
> +			  struct zone **zone,
> +			  unsigned long *distance)
> +{
> +	unsigned long entry = (unsigned long)shadow;
> +	unsigned long eviction;
> +	unsigned long refault;
> +	unsigned long mask;
> +	int zid, nid;
> +
> +	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
> +	zid = entry & ((1UL << ZONES_SHIFT) - 1);
> +	entry >>= ZONES_SHIFT;
> +	nid = entry & ((1UL << NODES_SHIFT) - 1);
> +	entry >>= NODES_SHIFT;
> +	eviction = entry;
> +
> +	*zone = NODE_DATA(nid)->node_zones + zid;
> +
> +	refault = atomic_long_read(&(*zone)->inactive_age);
> +	mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
> +			RADIX_TREE_EXCEPTIONAL_SHIFT);
> +	/*
> +	 * The unsigned subtraction here gives an accurate distance
> +	 * across inactive_age overflows in most cases.
> +	 *
> +	 * There is a special case: usually, shadow entries have a
> +	 * short lifetime and are either refaulted or reclaimed along
> +	 * with the inode before they get too old.  But it is not
> +	 * impossible for the inactive_age to lap a shadow entry in
> +	 * the field, which can then can result in a false small
> +	 * refault distance, leading to a false activation should this
> +	 * old entry actually refault again.  However, earlier kernels
> +	 * used to deactivate unconditionally with *every* reclaim
> +	 * invocation for the longest time, so the occasional
> +	 * inappropriate activation leading to pressure on the active
> +	 * list is not a problem.
> +	 */
> +	*distance = (refault - eviction) & mask;
> +}
> +
> +/**
> + * workingset_eviction - note the eviction of a page from memory
> + * @mapping: address space the page was backing
> + * @page: the page being evicted
> + *
> + * Returns a shadow entry to be stored in @mapping->page_tree in place
> + * of the evicted @page so that a later refault can be detected.
> + */
> +void *workingset_eviction(struct address_space *mapping, struct page *page)
> +{
> +	struct zone *zone = page_zone(page);
> +	unsigned long eviction;
> +
> +	eviction = atomic_long_inc_return(&zone->inactive_age);
> +	return pack_shadow(eviction, zone);
> +}
> +
> +/**
> + * workingset_refault - evaluate the refault of a previously evicted page
> + * @shadow: shadow entry of the evicted page
> + *
> + * Calculates and evaluates the refault distance of the previously
> + * evicted page in the context of the zone it was allocated in.
> + *
> + * Returns %true if the page should be activated, %false otherwise.
> + */
> +bool workingset_refault(void *shadow)
> +{
> +	unsigned long refault_distance;
> +	struct zone *zone;
> +
> +	unpack_shadow(shadow, &zone, &refault_distance);
> +	inc_zone_state(zone, WORKINGSET_REFAULT);
> +
> +	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
> +		inc_zone_state(zone, WORKINGSET_ACTIVATE);
> +		return true;
> +	}
> +	return false;
> +}
> +
> +/**
> + * workingset_activation - note a page activation
> + * @page: page that is being activated
> + */
> +void workingset_activation(struct page *page)
> +{
> +	atomic_long_inc(&page_zone(page)->inactive_age);
> +}
> -- 
> 1.8.4.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html