On Fri, Jan 10, 2014 at 01:10:41PM -0500, Johannes Weiner wrote: > The VM maintains cached filesystem pages on two types of lists. One > list holds the pages recently faulted into the cache, the other list > holds pages that have been referenced repeatedly on that first list. > The idea is to prefer reclaiming young pages over those that have > shown to benefit from caching in the past. We call the recently used > list "inactive list" and the frequently used list "active list". > > Currently, the VM aims for a 1:1 ratio between the lists, which is the > "perfect" trade-off between the ability to *protect* frequently used > pages and the ability to *detect* frequently used pages. This means > that working set changes bigger than half of cache memory go > undetected and thrash indefinitely, whereas working sets bigger than > half of cache memory are unprotected against used-once streams that > don't even need caching. > > Historically, every reclaim scan of the inactive list also took a > smaller number of pages from the tail of the active list and moved > them to the head of the inactive list. This model gave established > working sets more gracetime in the face of temporary use-once streams, > but ultimately was not significantly better than a FIFO policy and > still thrashed cache based on eviction speed, rather than actual > demand for cache. > > This patch solves one half of the problem by decoupling the ability to > detect working set changes from the inactive list size. By > maintaining a history of recently evicted file pages it can detect > frequently used pages with an arbitrarily small inactive list size, > and subsequently apply pressure on the active list based on actual > demand for cache, not just overall eviction speed. > > Every zone maintains a counter that tracks inactive list aging speed. > When a page is evicted, a snapshot of this counter is stored in the > now-empty page cache radix tree slot. On refault, the minimum access > distance of the page can be assessed, to evaluate whether the page > should be part of the active list or not. > > This fixes the VM's blindness towards working set changes in excess of > the inactive list. And it's the foundation to further improve the > protection ability and reduce the minimum inactive list size of 50%. > > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> Reviewed-by: Minchan Kim <minchan@xxxxxxxxxx> Really nice description and code to understand. I believe we should really merge/maintain such advanced algorithm, which will end up putting more advanced concept easily. Johannes, thanks for the your effort! > --- > include/linux/mmzone.h | 5 + > include/linux/swap.h | 5 + > mm/Makefile | 2 +- > mm/filemap.c | 61 ++++++++---- > mm/swap.c | 2 + > mm/vmscan.c | 24 ++++- > mm/vmstat.c | 2 + > mm/workingset.c | 253 +++++++++++++++++++++++++++++++++++++++++++++++++ > 8 files changed, 331 insertions(+), 23 deletions(-) > create mode 100644 mm/workingset.c > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index bd791e452ad7..118ba9f51e86 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -142,6 +142,8 @@ enum zone_stat_item { > NUMA_LOCAL, /* allocation from local node */ > NUMA_OTHER, /* allocation from other node */ > #endif > + WORKINGSET_REFAULT, > + WORKINGSET_ACTIVATE, > NR_ANON_TRANSPARENT_HUGEPAGES, > NR_FREE_CMA_PAGES, > NR_VM_ZONE_STAT_ITEMS }; > @@ -392,6 +394,9 @@ struct zone { > spinlock_t lru_lock; > struct lruvec lruvec; > > + /* Evictions & activations on the inactive file list */ > + atomic_long_t inactive_age; > + > unsigned long pages_scanned; /* since last reclaim */ > unsigned long flags; /* zone flags, see below */ > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 46ba0c6c219f..b83cf61403ed 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -260,6 +260,11 @@ struct swap_list_t { > int next; /* swapfile to be used next */ > }; > > +/* linux/mm/workingset.c */ > +void *workingset_eviction(struct address_space *mapping, struct page *page); > +bool workingset_refault(void *shadow); > +void workingset_activation(struct page *page); > + > /* linux/mm/page_alloc.c */ > extern unsigned long totalram_pages; > extern unsigned long totalreserve_pages; > diff --git a/mm/Makefile b/mm/Makefile > index 305d10acd081..b30aeb86abd6 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -17,7 +17,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ > util.o mmzone.o vmstat.o backing-dev.o \ > mm_init.o mmu_context.o percpu.o slab_common.o \ > compaction.o balloon_compaction.o \ > - interval_tree.o list_lru.o $(mmu-y) > + interval_tree.o list_lru.o workingset.o $(mmu-y) > > obj-y += init-mm.o > > diff --git a/mm/filemap.c b/mm/filemap.c > index d02db5801dda..65a374c0df4f 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -469,7 +469,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask) > EXPORT_SYMBOL_GPL(replace_page_cache_page); > > static int page_cache_tree_insert(struct address_space *mapping, > - struct page *page) > + struct page *page, void **shadowp) > { > void **slot; > int error; > @@ -484,6 +484,8 @@ static int page_cache_tree_insert(struct address_space *mapping, > radix_tree_replace_slot(slot, page); > mapping->nrshadows--; > mapping->nrpages++; > + if (shadowp) > + *shadowp = p; > return 0; > } > error = radix_tree_insert(&mapping->page_tree, page->index, page); > @@ -492,18 +494,10 @@ static int page_cache_tree_insert(struct address_space *mapping, > return error; > } > > -/** > - * add_to_page_cache_locked - add a locked page to the pagecache > - * @page: page to add > - * @mapping: the page's address_space > - * @offset: page index > - * @gfp_mask: page allocation mode > - * > - * This function is used to add a page to the pagecache. It must be locked. > - * This function does not add the page to the LRU. The caller must do that. > - */ > -int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > - pgoff_t offset, gfp_t gfp_mask) > +static int __add_to_page_cache_locked(struct page *page, > + struct address_space *mapping, > + pgoff_t offset, gfp_t gfp_mask, > + void **shadowp) > { > int error; > > @@ -526,7 +520,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > page->index = offset; > > spin_lock_irq(&mapping->tree_lock); > - error = page_cache_tree_insert(mapping, page); > + error = page_cache_tree_insert(mapping, page, shadowp); > radix_tree_preload_end(); > if (unlikely(error)) > goto err_insert; > @@ -542,16 +536,49 @@ err_insert: > page_cache_release(page); > return error; > } > + > +/** > + * add_to_page_cache_locked - add a locked page to the pagecache > + * @page: page to add > + * @mapping: the page's address_space > + * @offset: page index > + * @gfp_mask: page allocation mode > + * > + * This function is used to add a page to the pagecache. It must be locked. > + * This function does not add the page to the LRU. The caller must do that. > + */ > +int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > + pgoff_t offset, gfp_t gfp_mask) > +{ > + return __add_to_page_cache_locked(page, mapping, offset, > + gfp_mask, NULL); > +} > EXPORT_SYMBOL(add_to_page_cache_locked); > > int add_to_page_cache_lru(struct page *page, struct address_space *mapping, > pgoff_t offset, gfp_t gfp_mask) > { > + void *shadow = NULL; > int ret; > > - ret = add_to_page_cache(page, mapping, offset, gfp_mask); > - if (ret == 0) > - lru_cache_add_file(page); > + __set_page_locked(page); > + ret = __add_to_page_cache_locked(page, mapping, offset, > + gfp_mask, &shadow); > + if (unlikely(ret)) > + __clear_page_locked(page); > + else { > + /* > + * The page might have been evicted from cache only > + * recently, in which case it should be activated like > + * any other repeatedly accessed page. > + */ > + if (shadow && workingset_refault(shadow)) { > + SetPageActive(page); > + workingset_activation(page); > + } else > + ClearPageActive(page); > + lru_cache_add(page); > + } > return ret; > } > EXPORT_SYMBOL_GPL(add_to_page_cache_lru); > diff --git a/mm/swap.c b/mm/swap.c > index f624e5b4b724..ece5c49d6364 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -519,6 +519,8 @@ void mark_page_accessed(struct page *page) > else > __lru_cache_activate_page(page); > ClearPageReferenced(page); > + if (page_is_file_cache(page)) > + workingset_activation(page); > } else if (!PageReferenced(page)) { > SetPageReferenced(page); > } > diff --git a/mm/vmscan.c b/mm/vmscan.c > index b954b31602cf..0d3c3d7f8c1b 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -505,7 +505,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, > * Same as remove_mapping, but if the page is removed from the mapping, it > * gets returned with a refcount of 0. > */ > -static int __remove_mapping(struct address_space *mapping, struct page *page) > +static int __remove_mapping(struct address_space *mapping, struct page *page, > + bool reclaimed) > { > BUG_ON(!PageLocked(page)); > BUG_ON(mapping != page_mapping(page)); > @@ -551,10 +552,23 @@ static int __remove_mapping(struct address_space *mapping, struct page *page) > swapcache_free(swap, page); > } else { > void (*freepage)(struct page *); > + void *shadow = NULL; > > freepage = mapping->a_ops->freepage; > - > - __delete_from_page_cache(page, NULL); > + /* > + * Remember a shadow entry for reclaimed file cache in > + * order to detect refaults, thus thrashing, later on. > + * > + * But don't store shadows in an address space that is > + * already exiting. This is not just an optizimation, > + * inode reclaim needs to empty out the radix tree or > + * the nodes are lost. Don't plant shadows behind its > + * back. > + */ > + if (reclaimed && page_is_file_cache(page) && > + !mapping_exiting(mapping)) > + shadow = workingset_eviction(mapping, page); > + __delete_from_page_cache(page, shadow); > spin_unlock_irq(&mapping->tree_lock); > mem_cgroup_uncharge_cache_page(page); > > @@ -577,7 +591,7 @@ cannot_free: > */ > int remove_mapping(struct address_space *mapping, struct page *page) > { > - if (__remove_mapping(mapping, page)) { > + if (__remove_mapping(mapping, page, false)) { > /* > * Unfreezing the refcount with 1 rather than 2 effectively > * drops the pagecache ref for us without requiring another > @@ -1047,7 +1061,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, > } > } > > - if (!mapping || !__remove_mapping(mapping, page)) > + if (!mapping || !__remove_mapping(mapping, page, true)) > goto keep_locked; > > /* > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 9bb314577911..3ac830d1b533 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -770,6 +770,8 @@ const char * const vmstat_text[] = { > "numa_local", > "numa_other", > #endif > + "workingset_refault", > + "workingset_activate", > "nr_anon_transparent_hugepages", > "nr_free_cma", > "nr_dirty_threshold", > diff --git a/mm/workingset.c b/mm/workingset.c > new file mode 100644 > index 000000000000..8a6c7cff4923 > --- /dev/null > +++ b/mm/workingset.c > @@ -0,0 +1,253 @@ > +/* > + * Workingset detection > + * > + * Copyright (C) 2013 Red Hat, Inc., Johannes Weiner > + */ > + > +#include <linux/memcontrol.h> > +#include <linux/writeback.h> > +#include <linux/pagemap.h> > +#include <linux/atomic.h> > +#include <linux/module.h> > +#include <linux/swap.h> > +#include <linux/fs.h> > +#include <linux/mm.h> > + > +/* > + * Double CLOCK lists > + * > + * Per zone, two clock lists are maintained for file pages: the > + * inactive and the active list. Freshly faulted pages start out at > + * the head of the inactive list and page reclaim scans pages from the > + * tail. Pages that are accessed multiple times on the inactive list > + * are promoted to the active list, to protect them from reclaim, > + * whereas active pages are demoted to the inactive list when the > + * active list grows too big. > + * > + * fault ------------------------+ > + * | > + * +--------------+ | +-------------+ > + * reclaim <- | inactive | <-+-- demotion | active | <--+ > + * +--------------+ +-------------+ | > + * | | > + * +-------------- promotion ------------------+ > + * > + * > + * Access frequency and refault distance > + * > + * A workload is thrashing when its pages are frequently used but they > + * are evicted from the inactive list every time before another access > + * would have promoted them to the active list. > + * > + * In cases where the average access distance between thrashing pages > + * is bigger than the size of memory there is nothing that can be > + * done - the thrashing set could never fit into memory under any > + * circumstance. > + * > + * However, the average access distance could be bigger than the > + * inactive list, yet smaller than the size of memory. In this case, > + * the set could fit into memory if it weren't for the currently > + * active pages - which may be used more, hopefully less frequently: > + * > + * +-memory available to cache-+ > + * | | > + * +-inactive------+-active----+ > + * a b | c d e f g h i | J K L M N | > + * +---------------+-----------+ > + * > + * It is prohibitively expensive to accurately track access frequency > + * of pages. But a reasonable approximation can be made to measure > + * thrashing on the inactive list, after which refaulting pages can be > + * activated optimistically to compete with the existing active pages. > + * > + * Approximating inactive page access frequency - Observations: > + * > + * 1. When a page is accessed for the first time, it is added to the > + * head of the inactive list, slides every existing inactive page > + * towards the tail by one slot, and pushes the current tail page > + * out of memory. > + * > + * 2. When a page is accessed for the second time, it is promoted to > + * the active list, shrinking the inactive list by one slot. This > + * also slides all inactive pages that were faulted into the cache > + * more recently than the activated page towards the tail of the > + * inactive list. > + * > + * Thus: > + * > + * 1. The sum of evictions and activations between any two points in > + * time indicate the minimum number of inactive pages accessed in > + * between. > + * > + * 2. Moving one inactive page N page slots towards the tail of the > + * list requires at least N inactive page accesses. > + * > + * Combining these: > + * > + * 1. When a page is finally evicted from memory, the number of > + * inactive pages accessed while the page was in cache is at least > + * the number of page slots on the inactive list. > + * > + * 2. In addition, measuring the sum of evictions and activations (E) > + * at the time of a page's eviction, and comparing it to another > + * reading (R) at the time the page faults back into memory tells > + * the minimum number of accesses while the page was not cached. > + * This is called the refault distance. > + * > + * Because the first access of the page was the fault and the second > + * access the refault, we combine the in-cache distance with the > + * out-of-cache distance to get the complete minimum access distance > + * of this page: > + * > + * NR_inactive + (R - E) > + * > + * And knowing the minimum access distance of a page, we can easily > + * tell if the page would be able to stay in cache assuming all page > + * slots in the cache were available: > + * > + * NR_inactive + (R - E) <= NR_inactive + NR_active > + * > + * which can be further simplified to > + * > + * (R - E) <= NR_active > + * > + * Put into words, the refault distance (out-of-cache) can be seen as > + * a deficit in inactive list space (in-cache). If the inactive list > + * had (R - E) more page slots, the page would not have been evicted > + * in between accesses, but activated instead. And on a full system, > + * the only thing eating into inactive list space is active pages. > + * > + * > + * Activating refaulting pages > + * > + * All that is known about the active list is that the pages have been > + * accessed more than once in the past. This means that at any given > + * time there is actually a good chance that pages on the active list > + * are no longer in active use. > + * > + * So when a refault distance of (R - E) is observed and there are at > + * least (R - E) active pages, the refaulting page is activated > + * optimistically in the hope that (R - E) active pages are actually > + * used less frequently than the refaulting page - or even not used at > + * all anymore. > + * > + * If this is wrong and demotion kicks in, the pages which are truly > + * used more frequently will be reactivated while the less frequently > + * used once will be evicted from memory. > + * > + * But if this is right, the stale pages will be pushed out of memory > + * and the used pages get to stay in cache. > + * > + * > + * Implementation > + * > + * For each zone's file LRU lists, a counter for inactive evictions > + * and activations is maintained (zone->inactive_age). > + * > + * On eviction, a snapshot of this counter (along with some bits to > + * identify the zone) is stored in the now empty page cache radix tree > + * slot of the evicted page. This is called a shadow entry. > + * > + * On cache misses for which there are shadow entries, an eligible > + * refault distance will immediately activate the refaulting page. > + */ > + > +static void *pack_shadow(unsigned long eviction, struct zone *zone) > +{ > + eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone); > + eviction = (eviction << ZONES_SHIFT) | zone_idx(zone); > + eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT); > + > + return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY); > +} > + > +static void unpack_shadow(void *shadow, > + struct zone **zone, > + unsigned long *distance) > +{ > + unsigned long entry = (unsigned long)shadow; > + unsigned long eviction; > + unsigned long refault; > + unsigned long mask; > + int zid, nid; > + > + entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT; > + zid = entry & ((1UL << ZONES_SHIFT) - 1); > + entry >>= ZONES_SHIFT; > + nid = entry & ((1UL << NODES_SHIFT) - 1); > + entry >>= NODES_SHIFT; > + eviction = entry; > + > + *zone = NODE_DATA(nid)->node_zones + zid; > + > + refault = atomic_long_read(&(*zone)->inactive_age); > + mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT + > + RADIX_TREE_EXCEPTIONAL_SHIFT); > + /* > + * The unsigned subtraction here gives an accurate distance > + * across inactive_age overflows in most cases. > + * > + * There is a special case: usually, shadow entries have a > + * short lifetime and are either refaulted or reclaimed along > + * with the inode before they get too old. But it is not > + * impossible for the inactive_age to lap a shadow entry in > + * the field, which can then can result in a false small > + * refault distance, leading to a false activation should this > + * old entry actually refault again. However, earlier kernels > + * used to deactivate unconditionally with *every* reclaim > + * invocation for the longest time, so the occasional > + * inappropriate activation leading to pressure on the active > + * list is not a problem. > + */ > + *distance = (refault - eviction) & mask; > +} > + > +/** > + * workingset_eviction - note the eviction of a page from memory > + * @mapping: address space the page was backing > + * @page: the page being evicted > + * > + * Returns a shadow entry to be stored in @mapping->page_tree in place > + * of the evicted @page so that a later refault can be detected. > + */ > +void *workingset_eviction(struct address_space *mapping, struct page *page) > +{ > + struct zone *zone = page_zone(page); > + unsigned long eviction; > + > + eviction = atomic_long_inc_return(&zone->inactive_age); > + return pack_shadow(eviction, zone); > +} > + > +/** > + * workingset_refault - evaluate the refault of a previously evicted page > + * @shadow: shadow entry of the evicted page > + * > + * Calculates and evaluates the refault distance of the previously > + * evicted page in the context of the zone it was allocated in. > + * > + * Returns %true if the page should be activated, %false otherwise. > + */ > +bool workingset_refault(void *shadow) > +{ > + unsigned long refault_distance; > + struct zone *zone; > + > + unpack_shadow(shadow, &zone, &refault_distance); > + inc_zone_state(zone, WORKINGSET_REFAULT); > + > + if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) { > + inc_zone_state(zone, WORKINGSET_ACTIVATE); > + return true; > + } > + return false; > +} > + > +/** > + * workingset_activation - note a page activation > + * @page: page that is being activated > + */ > +void workingset_activation(struct page *page) > +{ > + atomic_long_inc(&page_zone(page)->inactive_age); > +} > -- > 1.8.4.2 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html