+ mm-vmstat-fix-up-zone-state-accounting.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Tue, 04 Feb 2014 15:14:39 -0800

Subject: + mm-vmstat-fix-up-zone-state-accounting.patch added to -mm tree
To: hannes@xxxxxxxxxxx,aarcange@xxxxxxxxxx,bob.liu@xxxxxxxxxx,david@xxxxxxxxxxxxx,gthelen@xxxxxxxxxx,hch@xxxxxxxxxxxxx,hughd@xxxxxxxxxx,jack@xxxxxxx,klamm@xxxxxxxxxxxxxx,kosaki.motohiro@xxxxxxxxxxxxxx,metin@xxxxxxxxxxxxx,mgorman@xxxxxxx,minchan.kim@xxxxxxxxx,ozgun@xxxxxxxxxxxxx,peterz@xxxxxxxxxxxxx,riel@xxxxxxxxxx,rmallon@xxxxxxxxx,semenzato@xxxxxxxxxx,tj@xxxxxxxxxx,vbabka@xxxxxxx,walken@xxxxxxxxxx
From: akpm@xxxxxxxxxxxxxxxxxxxx
Date: Tue, 04 Feb 2014 15:14:39 -0800


The patch titled
     Subject: mm: vmstat: fix UP zone state accounting
has been added to the -mm tree.  Its filename is
     mm-vmstat-fix-up-zone-state-accounting.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-vmstat-fix-up-zone-state-accounting.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-vmstat-fix-up-zone-state-accounting.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Johannes Weiner <hannes@xxxxxxxxxxx>
Subject: mm: vmstat: fix UP zone state accounting

	Summary

The VM maintains cached filesystem pages on two types of lists.  One list
holds the pages recently faulted into the cache, the other list holds
pages that have been referenced repeatedly on that first list.  The idea
is to prefer reclaiming young pages over those that have shown to benefit
from caching in the past.  We call the recently used list "inactive list"
and the frequently used list "active list".

Currently, the VM aims for a 1:1 ratio between the lists, which is the
"perfect" trade-off between the ability to *protect* frequently used pages
and the ability to *detect* frequently used pages.  This means that
working set changes bigger than half of cache memory go undetected and
thrash indefinitely, whereas working sets bigger than half of cache memory
are unprotected against used-once streams that don't even need caching.

This happens on file servers and media streaming servers, where the
popular files and file sections change over time.  Even though the
individual files might be smaller than half of memory, concurrent access
to many of them may still result in their inter-reference distance being
greater than half of memory.  It's also been reported as a problem on
database workloads that switch back and forth between tables that are
bigger than half of memory.  In these cases the VM never recognizes the
new working set and will for the remainder of the workload thrash disk
data which could easily live in memory.

Historically, every reclaim scan of the inactive list also took a smaller
number of pages from the tail of the active list and moved them to the
head of the inactive list.  This model gave established working sets more
gracetime in the face of temporary use-once streams, but ultimately was
not significantly better than a FIFO policy and still thrashed cache based
on eviction speed, rather than actual demand for cache.

This series solves the problem by maintaining a history of pages evicted
from the inactive list, enabling the VM to detect frequently used pages
regardless of inactive list size and facilitate working set transitions.

	Tests

The reported database workload is easily demonstrated on a 8G machine with
two filesets a 6G.  This fio workload operates on one set first, then
switches to the other.  The VM should obviously always cache the set that
the workload is currently using.

This test is based on a problem encountered by Citus Data customers:
http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data

unpatched:
db1: READ: io=98304MB, aggrb=885559KB/s, minb=885559KB/s, maxb=885559KB/s, mint= 113672msec, maxt= 113672msec
db2: READ: io=98304MB, aggrb= 66169KB/s, minb= 66169KB/s, maxb= 66169KB/s, mint=1521302msec, maxt=1521302msec
sdb: ios=835750/4, merge=2/1, ticks=4659739/60016, in_queue=4719203, util=98.92%

real    27m15.541s
user    0m19.059s
sys     0m51.459s

patched:
db1: READ: io=98304MB, aggrb=877783KB/s, minb=877783KB/s, maxb=877783KB/s, mint=114679msec, maxt=114679msec
db2: READ: io=98304MB, aggrb=397449KB/s, minb=397449KB/s, maxb=397449KB/s, mint=253273msec, maxt=253273msec
sdb: ios=170587/4, merge=2/1, ticks=954910/61123, in_queue=1015923, util=90.40%

real    6m8.630s
user    0m14.714s
sys     0m31.233s

As can be seen, the unpatched kernel simply never adapts to the workingset
change and db2 is stuck indefinitely with secondary storage speed.  The
patched kernel needs 2-3 iterations over db2 before it replaces db1 and
reaches full memory speed.  Given the unbounded negative affect of the
existing VM behavior, these patches should be considered correctness fixes
rather than performance optimizations.

Another test resembles a fileserver or streaming server workload, where
data in excess of memory size is accessed at different frequencies.  There
is very hot data accessed at a high frequency.  Machines should be fitted
so that the hot set of such a workload can be fully cached or all bets are
off.  Then there is a very big (compared to available memory) set of data
that is used-once or at a very low frequency; this is what drives the
inactive list and does not really benefit from caching.  Lastly, there is
a big set of warm data in between that is accessed at medium frequencies
and benefits from caching the pages between the first and last streamer of
each burst.

unpatched:
 hot: READ: io=128000MB, aggrb=160693KB/s, minb=160693KB/s, maxb=160693KB/s, mint=815665msec, maxt=815665msec
warm: READ: io= 81920MB, aggrb=109853KB/s, minb= 27463KB/s, maxb= 29244KB/s, mint=717110msec, maxt=763617msec
cold: READ: io= 30720MB, aggrb= 35245KB/s, minb= 35245KB/s, maxb= 35245KB/s, mint=892530msec, maxt=892530msec
 sdb: ios=797960/4, merge=11763/1, ticks=4307910/796, in_queue=4308380, util=100.00%

patched:
 hot: READ: io=128000MB, aggrb=160678KB/s, minb=160678KB/s, maxb=160678KB/s, mint=815740msec, maxt=815740msec
warm: READ: io= 81920MB, aggrb=147747KB/s, minb= 36936KB/s, maxb= 40960KB/s, mint=512000msec, maxt=567767msec
cold: READ: io= 30720MB, aggrb= 40960KB/s, minb= 40960KB/s, maxb= 40960KB/s, mint=768000msec, maxt=768000msec
 sdb: ios=596514/4, merge=9341/1, ticks=2395362/997, in_queue=2396484, util=79.18%

In both kernels, the hot set is propagated to the active list and then
served from cache.

In both kernels, the beginning of the warm set is propagated to the active
list as well, but in the unpatched case the active list eventually takes
up half of memory and no new pages from the warm set get activated,
despite repeated access, and despite most of the active list soon being
stale.  The patched kernel on the other hand detects the thrashing and
manages to keep this cache window rolling through the data set.  This
frees up enough IO bandwidth that the cold set is served at full speed as
well and disk utilization even drops by 20%.

For reference, this same test was performed with the traditional demotion
mechanism, where deactivation is coupled to inactive list reclaim. 
However, this had the same outcome as the unpatched kernel: while the warm
set does indeed get activated continuously, it is forced out of the active
list by inactive list pressure, which is dictated primarily by the
unrelated cold set.  The warm set is evicted before subsequent streamers
can benefit from it, even though there would be enough space available to
cache the pages of interest.

	Costs

Page reclaim used to shrink the radix trees but now the tree nodes are
reused for shadow entries, where the cost depends heavily on the page
cache access patterns.  However, with workloads that maintain spatial or
temporal locality, the shadow entries are either refaulted quickly or
reclaimed along with the inode object itself.  Workloads that will
experience a memory cost increase are those that don't really benefit from
caching in the first place.

A more predictable alternative would be a fixed-cost separate pool of
shadow entries, but this would incur relatively higher memory cost for
well-behaved workloads at the benefit of cornercases.  It would also make
the shadow entry lookup more costly compared to storing them directly in
the cache structure.

	Future

To simplify the merging process, this patch set is implementing thrash
detection on a global per-zone level only for now, but the design is such
that it can be extended to memory cgroups as well.  All we need to do is
store the unique cgroup ID along the node and zone identifier inside the
eviction cookie to identify the lruvec.

Right now we have a fixed ratio (50:50) between inactive and active list
but we already have complaints about working sets exceeding half of memory
being pushed out of the cache by simple streaming in the background. 
Ultimately, we want to adjust this ratio and allow for a much smaller
inactive list.  These patches are an essential step in this direction
because they decouple the VMs ability to detect working set changes from
the inactive list size.  This would allow us to base the inactive list
size on the combined readahead window size for example and potentially
protect a much bigger working set.

It's also a big step towards activating pages with a reuse distance larger
than memory, as long as they are the most frequently used pages in the
workload.  This will require knowing more about the access frequency of
active pages than what we measure right now, so it's also deferred in this
series.

Another possibility of having thrashing information would be to revisit
the idea of local reclaim in the form of zero-config memory control
groups.  Instead of having allocating tasks go straight to global reclaim,
they could try to reclaim the pages in the memcg they are part of first as
long as the group is not thrashing.  This would allow a user to drop e.g. 
a back-up job in an otherwise unconfigured memcg and it would only inflate
(and possibly do global reclaim) until it has enough memory to do proper
readahead.  But once it reaches that point and stops thrashing it would
just recycle its own used-once pages without kicking out the cache of any
other tasks in the system more than necessary.



This patch (of 10):

Fengguang Wu's build testing spotted problems with inc_zone_state()
and dec_zone_state() on UP configurations in out-of-tree patches.

inc_zone_state() is declared but not defined, dec_zone_state() is
missing entirely.

Just like with *_zone_page_state(), they can be defined like their
preemption-unsafe counterparts on UP.

Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Cc: Bob Liu <bob.liu@xxxxxxxxxx>
Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Cc: Dave Chinner <david@xxxxxxxxxxxxx>
Cc: Greg Thelen <gthelen@xxxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Jan Kara <jack@xxxxxxx>
Cc: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
Cc: Luigi Semenzato <semenzato@xxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Metin Doslu <metin@xxxxxxxxxxxxx>
Cc: Michel Lespinasse <walken@xxxxxxxxxx>
Cc: Minchan Kim <minchan.kim@xxxxxxxxx>
Cc: Ozgun Erdogan <ozgun@xxxxxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxx>
Cc: Roman Gushchin <klamm@xxxxxxxxxxxxxx>
Cc: Ryan Mallon <rmallon@xxxxxxxxx>
Cc: Tejun Heo <tj@xxxxxxxxxx>
Cc: Vlastimil Babka <vbabka@xxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 include/linux/vmstat.h |   29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff -puN include/linux/vmstat.h~mm-vmstat-fix-up-zone-state-accounting include/linux/vmstat.h

--- a/include/linux/vmstat.h~mm-vmstat-fix-up-zone-state-accounting
+++ a/include/linux/vmstat.h
@@ -179,8 +179,6 @@ extern void zone_statistics(struct zone
 #define add_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, __d)
 #define sub_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, -(__d))
 
-extern void inc_zone_state(struct zone *, enum zone_stat_item);
-
 #ifdef CONFIG_SMP
 void __mod_zone_page_state(struct zone *, enum zone_stat_item item, int);
 void __inc_zone_page_state(struct page *, enum zone_stat_item);
@@ -216,24 +214,12 @@ static inline void __mod_zone_page_state
 	zone_page_state_add(delta, zone, item);
 }
 
-static inline void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
-{
-	atomic_long_inc(&zone->vm_stat[item]);
-	atomic_long_inc(&vm_stat[item]);
-}
-
 static inline void __inc_zone_page_state(struct page *page,
 			enum zone_stat_item item)
 {
 	__inc_zone_state(page_zone(page), item);
 }
 
-static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
-{
-	atomic_long_dec(&zone->vm_stat[item]);
-	atomic_long_dec(&vm_stat[item]);
-}
-
 static inline void __dec_zone_page_state(struct page *page,
 			enum zone_stat_item item)
 {
@@ -248,6 +234,21 @@ static inline void __dec_zone_page_state
 #define dec_zone_page_state __dec_zone_page_state
 #define mod_zone_page_state __mod_zone_page_state
 
+static inline void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
+{
+	atomic_long_inc(&zone->vm_stat[item]);
+	atomic_long_inc(&vm_stat[item]);
+}
+
+static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
+{
+	atomic_long_dec(&zone->vm_stat[item]);
+	atomic_long_dec(&vm_stat[item]);
+}
+
+#define inc_zone_state __inc_zone_state
+#define dec_zone_state __dec_zone_state
+
 #define set_pgdat_percpu_threshold(pgdat, callback) { }
 
 static inline void refresh_cpu_vm_stats(int cpu) { }
_

Patches currently in -mm which might be from hannes@xxxxxxxxxxx are

mm-vmscan-respect-numa-policy-mask-when-shrinking-slab-on-direct-reclaim.patch
mm-vmscan-move-call-to-shrink_slab-to-shrink_zones.patch
mm-vmscan-remove-shrink_control-arg-from-do_try_to_free_pages.patch
mm-vmstat-fix-up-zone-state-accounting.patch
fs-cachefiles-use-add_to_page_cache_lru.patch
lib-radix-tree-radix_tree_delete_item.patch
mm-shmem-save-one-radix-tree-lookup-when-truncating-swapped-pages.patch
mm-filemap-move-radix-tree-hole-searching-here.patch
mm-fs-prepare-for-non-page-entries-in-page-cache-radix-trees.patch
mm-fs-store-shadow-entries-in-page-cache.patch
mm-thrash-detection-based-file-cache-sizing.patch
lib-radix_tree-tree-node-interface.patch
mm-keep-page-cache-radix-tree-nodes-in-check.patch
linux-next.patch
debugging-keep-track-of-page-owners.patch

--
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html