Re: [patch 7/9] mm: thrash detection-based file cache sizing

Bob Liu <bob.liu@xxxxxxxxxx> · Tue, 14 Jan 2014 09:01:09 +0800

Hi Johannes,

On 01/11/2014 02:10 AM, Johannes Weiner wrote:
> The VM maintains cached filesystem pages on two types of lists.  One
> list holds the pages recently faulted into the cache, the other list
> holds pages that have been referenced repeatedly on that first list.
> The idea is to prefer reclaiming young pages over those that have
> shown to benefit from caching in the past.  We call the recently used
> list "inactive list" and the frequently used list "active list".
> 
> Currently, the VM aims for a 1:1 ratio between the lists, which is the
> "perfect" trade-off between the ability to *protect* frequently used
> pages and the ability to *detect* frequently used pages.  This means
> that working set changes bigger than half of cache memory go
> undetected and thrash indefinitely, whereas working sets bigger than
> half of cache memory are unprotected against used-once streams that
> don't even need caching.
> 

Good job! This patch looks good to me and with nice descriptions.
But it seems that this patch only fix the issue "working set changes
bigger than half of cache memory go undetected and thrash indefinitely".
My concern is could it be extended easily to address all other issues
based on this patch set?

The other possible way is something like Peter has implemented the CART
and Clock-Pro which I think may be better because of using advanced
algorithms and consider the problem as a whole from the beginning.(Sorry
I haven't get enough time to read the source code, so I'm not 100% sure.)
http://linux-mm.org/PeterZClockPro2

> Historically, every reclaim scan of the inactive list also took a
> smaller number of pages from the tail of the active list and moved
> them to the head of the inactive list.  This model gave established
> working sets more gracetime in the face of temporary use-once streams,
> but ultimately was not significantly better than a FIFO policy and
> still thrashed cache based on eviction speed, rather than actual
> demand for cache.
> 
> This patch solves one half of the problem by decoupling the ability to
> detect working set changes from the inactive list size.  By
> maintaining a history of recently evicted file pages it can detect
> frequently used pages with an arbitrarily small inactive list size,
> and subsequently apply pressure on the active list based on actual
> demand for cache, not just overall eviction speed.
> 
> Every zone maintains a counter that tracks inactive list aging speed.
> When a page is evicted, a snapshot of this counter is stored in the
> now-empty page cache radix tree slot.  On refault, the minimum access
> distance of the page can be assessed, to evaluate whether the page
> should be part of the active list or not.
> 
> This fixes the VM's blindness towards working set changes in excess of
> the inactive list.  And it's the foundation to further improve the
> protection ability and reduce the minimum inactive list size of 50%.
> 
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> ---
>  include/linux/mmzone.h |   5 +
>  include/linux/swap.h   |   5 +
>  mm/Makefile            |   2 +-
>  mm/filemap.c           |  61 ++++++++----
>  mm/swap.c              |   2 +
>  mm/vmscan.c            |  24 ++++-
>  mm/vmstat.c            |   2 +
>  mm/workingset.c        | 253 +++++++++++++++++++++++++++++++++++++++++++++++++
>  8 files changed, 331 insertions(+), 23 deletions(-)
>  create mode 100644 mm/workingset.c
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index bd791e452ad7..118ba9f51e86 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -142,6 +142,8 @@ enum zone_stat_item {
>  	NUMA_LOCAL,		/* allocation from local node */
>  	NUMA_OTHER,		/* allocation from other node */
>  #endif
> +	WORKINGSET_REFAULT,
> +	WORKINGSET_ACTIVATE,
>  	NR_ANON_TRANSPARENT_HUGEPAGES,
>  	NR_FREE_CMA_PAGES,
>  	NR_VM_ZONE_STAT_ITEMS };
> @@ -392,6 +394,9 @@ struct zone {
>  	spinlock_t		lru_lock;
>  	struct lruvec		lruvec;
>  
> +	/* Evictions & activations on the inactive file list */
> +	atomic_long_t		inactive_age;
> +
>  	unsigned long		pages_scanned;	   /* since last reclaim */
>  	unsigned long		flags;		   /* zone flags, see below */
>  
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 46ba0c6c219f..b83cf61403ed 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -260,6 +260,11 @@ struct swap_list_t {
>  	int next;	/* swapfile to be used next */
>  };
>  
> +/* linux/mm/workingset.c */
> +void *workingset_eviction(struct address_space *mapping, struct page *page);
> +bool workingset_refault(void *shadow);
> +void workingset_activation(struct page *page);
> +
>  /* linux/mm/page_alloc.c */
>  extern unsigned long totalram_pages;
>  extern unsigned long totalreserve_pages;
> diff --git a/mm/Makefile b/mm/Makefile
> index 305d10acd081..b30aeb86abd6 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -17,7 +17,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   util.o mmzone.o vmstat.o backing-dev.o \
>  			   mm_init.o mmu_context.o percpu.o slab_common.o \
>  			   compaction.o balloon_compaction.o \
> -			   interval_tree.o list_lru.o $(mmu-y)
> +			   interval_tree.o list_lru.o workingset.o $(mmu-y)
>  
>  obj-y += init-mm.o
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d02db5801dda..65a374c0df4f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -469,7 +469,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
>  EXPORT_SYMBOL_GPL(replace_page_cache_page);
>  
>  static int page_cache_tree_insert(struct address_space *mapping,
> -				  struct page *page)
> +				  struct page *page, void **shadowp)
>  {
>  	void **slot;
>  	int error;
> @@ -484,6 +484,8 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  		radix_tree_replace_slot(slot, page);
>  		mapping->nrshadows--;
>  		mapping->nrpages++;
> +		if (shadowp)
> +			*shadowp = p;
>  		return 0;
>  	}
>  	error = radix_tree_insert(&mapping->page_tree, page->index, page);
> @@ -492,18 +494,10 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  	return error;
>  }
>  
> -/**
> - * add_to_page_cache_locked - add a locked page to the pagecache
> - * @page:	page to add
> - * @mapping:	the page's address_space
> - * @offset:	page index
> - * @gfp_mask:	page allocation mode
> - *
> - * This function is used to add a page to the pagecache. It must be locked.
> - * This function does not add the page to the LRU.  The caller must do that.
> - */
> -int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> -		pgoff_t offset, gfp_t gfp_mask)
> +static int __add_to_page_cache_locked(struct page *page,
> +				      struct address_space *mapping,
> +				      pgoff_t offset, gfp_t gfp_mask,
> +				      void **shadowp)
>  {
>  	int error;
>  
> @@ -526,7 +520,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  	page->index = offset;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	error = page_cache_tree_insert(mapping, page);
> +	error = page_cache_tree_insert(mapping, page, shadowp);
>  	radix_tree_preload_end();
>  	if (unlikely(error))
>  		goto err_insert;
> @@ -542,16 +536,49 @@ err_insert:
>  	page_cache_release(page);
>  	return error;
>  }
> +
> +/**
> + * add_to_page_cache_locked - add a locked page to the pagecache
> + * @page:	page to add
> + * @mapping:	the page's address_space
> + * @offset:	page index
> + * @gfp_mask:	page allocation mode
> + *
> + * This function is used to add a page to the pagecache. It must be locked.
> + * This function does not add the page to the LRU.  The caller must do that.
> + */
> +int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> +		pgoff_t offset, gfp_t gfp_mask)
> +{
> +	return __add_to_page_cache_locked(page, mapping, offset,
> +					  gfp_mask, NULL);
> +}
>  EXPORT_SYMBOL(add_to_page_cache_locked);
>  
>  int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>  				pgoff_t offset, gfp_t gfp_mask)
>  {
> +	void *shadow = NULL;
>  	int ret;
>  
> -	ret = add_to_page_cache(page, mapping, offset, gfp_mask);
> -	if (ret == 0)
> -		lru_cache_add_file(page);
> +	__set_page_locked(page);
> +	ret = __add_to_page_cache_locked(page, mapping, offset,
> +					 gfp_mask, &shadow);
> +	if (unlikely(ret))
> +		__clear_page_locked(page);
> +	else {
> +		/*
> +		 * The page might have been evicted from cache only
> +		 * recently, in which case it should be activated like
> +		 * any other repeatedly accessed page.
> +		 */
> +		if (shadow && workingset_refault(shadow)) {
> +			SetPageActive(page);
> +			workingset_activation(page);
> +		} else
> +			ClearPageActive(page);
> +		lru_cache_add(page);
> +	}
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
> diff --git a/mm/swap.c b/mm/swap.c
> index f624e5b4b724..ece5c49d6364 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -519,6 +519,8 @@ void mark_page_accessed(struct page *page)
>  		else
>  			__lru_cache_activate_page(page);
>  		ClearPageReferenced(page);
> +		if (page_is_file_cache(page))
> +			workingset_activation(page);
>  	} else if (!PageReferenced(page)) {
>  		SetPageReferenced(page);
>  	}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b954b31602cf..0d3c3d7f8c1b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -505,7 +505,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
>   * Same as remove_mapping, but if the page is removed from the mapping, it
>   * gets returned with a refcount of 0.
>   */
> -static int __remove_mapping(struct address_space *mapping, struct page *page)
> +static int __remove_mapping(struct address_space *mapping, struct page *page,
> +			    bool reclaimed)
>  {
>  	BUG_ON(!PageLocked(page));
>  	BUG_ON(mapping != page_mapping(page));
> @@ -551,10 +552,23 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
>  		swapcache_free(swap, page);
>  	} else {
>  		void (*freepage)(struct page *);
> +		void *shadow = NULL;
>  
>  		freepage = mapping->a_ops->freepage;
> -
> -		__delete_from_page_cache(page, NULL);
> +		/*
> +		 * Remember a shadow entry for reclaimed file cache in
> +		 * order to detect refaults, thus thrashing, later on.
> +		 *
> +		 * But don't store shadows in an address space that is
> +		 * already exiting.  This is not just an optizimation,
> +		 * inode reclaim needs to empty out the radix tree or
> +		 * the nodes are lost.  Don't plant shadows behind its
> +		 * back.
> +		 */
> +		if (reclaimed && page_is_file_cache(page) &&
> +		    !mapping_exiting(mapping))
> +			shadow = workingset_eviction(mapping, page);
> +		__delete_from_page_cache(page, shadow);
>  		spin_unlock_irq(&mapping->tree_lock);
>  		mem_cgroup_uncharge_cache_page(page);
>  
> @@ -577,7 +591,7 @@ cannot_free:
>   */
>  int remove_mapping(struct address_space *mapping, struct page *page)
>  {
> -	if (__remove_mapping(mapping, page)) {
> +	if (__remove_mapping(mapping, page, false)) {
>  		/*
>  		 * Unfreezing the refcount with 1 rather than 2 effectively
>  		 * drops the pagecache ref for us without requiring another
> @@ -1047,7 +1061,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -		if (!mapping || !__remove_mapping(mapping, page))
> +		if (!mapping || !__remove_mapping(mapping, page, true))
>  			goto keep_locked;
>  
>  		/*
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 9bb314577911..3ac830d1b533 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -770,6 +770,8 @@ const char * const vmstat_text[] = {
>  	"numa_local",
>  	"numa_other",
>  #endif
> +	"workingset_refault",
> +	"workingset_activate",
>  	"nr_anon_transparent_hugepages",
>  	"nr_free_cma",
>  	"nr_dirty_threshold",
> diff --git a/mm/workingset.c b/mm/workingset.c
> new file mode 100644
> index 000000000000..8a6c7cff4923
> --- /dev/null
> +++ b/mm/workingset.c
> @@ -0,0 +1,253 @@
> +/*
> + * Workingset detection
> + *
> + * Copyright (C) 2013 Red Hat, Inc., Johannes Weiner
> + */
> +
> +#include <linux/memcontrol.h>
> +#include <linux/writeback.h>
> +#include <linux/pagemap.h>
> +#include <linux/atomic.h>
> +#include <linux/module.h>
> +#include <linux/swap.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +
> +/*
> + *		Double CLOCK lists
> + *
> + * Per zone, two clock lists are maintained for file pages: the
> + * inactive and the active list.  Freshly faulted pages start out at
> + * the head of the inactive list and page reclaim scans pages from the
> + * tail.  Pages that are accessed multiple times on the inactive list
> + * are promoted to the active list, to protect them from reclaim,
> + * whereas active pages are demoted to the inactive list when the
> + * active list grows too big.
> + *
> + *   fault ------------------------+
> + *                                 |
> + *              +--------------+   |            +-------------+
> + *   reclaim <- |   inactive   | <-+-- demotion |    active   | <--+
> + *              +--------------+                +-------------+    |
> + *                     |                                           |
> + *                     +-------------- promotion ------------------+
> + *
> + *
> + *		Access frequency and refault distance
> + *
> + * A workload is thrashing when its pages are frequently used but they
> + * are evicted from the inactive list every time before another access
> + * would have promoted them to the active list.
> + *
> + * In cases where the average access distance between thrashing pages
> + * is bigger than the size of memory there is nothing that can be
> + * done - the thrashing set could never fit into memory under any
> + * circumstance.
> + *
> + * However, the average access distance could be bigger than the
> + * inactive list, yet smaller than the size of memory.  In this case,
> + * the set could fit into memory if it weren't for the currently
> + * active pages - which may be used more, hopefully less frequently:
> + *
> + *      +-memory available to cache-+
> + *      |                           |
> + *      +-inactive------+-active----+
> + *  a b | c d e f g h i | J K L M N |
> + *      +---------------+-----------+
> + *
> + * It is prohibitively expensive to accurately track access frequency
> + * of pages.  But a reasonable approximation can be made to measure
> + * thrashing on the inactive list, after which refaulting pages can be
> + * activated optimistically to compete with the existing active pages.
> + *
> + * Approximating inactive page access frequency - Observations:
> + *
> + * 1. When a page is accessed for the first time, it is added to the
> + *    head of the inactive list, slides every existing inactive page
> + *    towards the tail by one slot, and pushes the current tail page
> + *    out of memory.
> + *
> + * 2. When a page is accessed for the second time, it is promoted to
> + *    the active list, shrinking the inactive list by one slot.  This
> + *    also slides all inactive pages that were faulted into the cache
> + *    more recently than the activated page towards the tail of the
> + *    inactive list.
> + *

Nitpick, how about the reference bit?

-- 
Regards,
-Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>