---
From: Andrew Bresticker <abrestic@xxxxxxxxxx>
Date: Thu, 14 Jul 2011 17:56:48 -0700
Subject: [PATCH] vmscan: Track number of pages written back during page reclaim.
This tracks pages written out during page reclaim in memory.vmscan_stat
and breaks it down by file vs. anon and context (like "scanned_pages",
"rotated_pages", etc.).
Example output:
$ mkdir /dev/cgroup/memory/1
$ echo 8m > /dev/cgroup/memory/1/memory.limit_in_bytes
$ echo $$ > /dev/cgroup/memory/1/tasks
$ dd if=/dev/urandom of=file_20g bs=4096 count=524288
$ cat /dev/cgroup/memory/1/memory.vmscan_stat
...
written_pages_by_limit 36
written_anon_pages_by_limit 0
written_file_pages_by_limit 36
...
written_pages_by_limit_under_hierarchy 28
written_anon_pages_by_limit_under_hierarchy 0
written_file_pages_by_limit_under_hierarchy 28
Signed-off-by: Andrew Bresticker <abrestic@xxxxxxxxxx>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 12 ++++++++++++
mm/vmscan.c | 10 +++++++---
3 files changed, 20 insertions(+), 3 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4b49edf..4be907e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -46,6 +46,7 @@ struct memcg_scanrecord {
unsigned long nr_scanned[2]; /* the number of scanned pages */
unsigned long nr_rotated[2]; /* the number of rotated pages */
unsigned long nr_freed[2]; /* the number of freed pages */
+ unsigned long nr_written[2]; /* the number of pages written back */
unsigned long elapsed; /* nsec of time elapsed while scanning */
};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9bb6e93..5ec2aa3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -221,6 +221,9 @@ enum {
FREED,
FREED_ANON,
FREED_FILE,
+ WRITTEN,
+ WRITTEN_ANON,
+ WRITTEN_FILE,
ELAPSED,
NR_SCANSTATS,
};
@@ -241,6 +244,9 @@ const char *scanstat_string[NR_SCANSTATS] = {
"freed_pages",
"freed_anon_pages",
"freed_file_pages",
+ "written_pages",
+ "written_anon_pages",
+ "written_file_pages",
"elapsed_ns",
};
#define SCANSTAT_WORD_LIMIT "_by_limit"
@@ -1682,6 +1688,10 @@ static void __mem_cgroup_record_scanstat(unsigned long *stats,
stats[FREED_ANON] += rec->nr_freed[0];
stats[FREED_FILE] += rec->nr_freed[1];
+ stats[WRITTEN] += rec->nr_written[0] + rec->nr_written[1];
+ stats[WRITTEN_ANON] += rec->nr_written[0];
+ stats[WRITTEN_FILE] += rec->nr_written[1];
+
stats[ELAPSED] += rec->elapsed;
}
@@ -1794,6 +1804,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
rec.nr_rotated[1] = 0;
rec.nr_freed[0] = 0;
rec.nr_freed[1] = 0;
+ rec.nr_written[0] = 0;
+ rec.nr_written[1] = 0;
rec.elapsed = 0;
/* we use swappiness of local cgroup */
if (check_soft) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8fb1abd..f73b96e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -719,7 +719,7 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
*/
static unsigned long shrink_page_list(struct list_head *page_list,
struct zone *zone,
- struct scan_control *sc)
+ struct scan_control *sc, int file)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -727,6 +727,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
unsigned long nr_dirty = 0;
unsigned long nr_congested = 0;
unsigned long nr_reclaimed = 0;
+ unsigned long nr_written = 0;
cond_resched();
@@ -840,6 +841,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
case PAGE_ACTIVATE:
goto activate_locked;
case PAGE_SUCCESS:
+ nr_written++;
if (PageWriteback(page))
goto keep_lumpy;
if (PageDirty(page))
@@ -958,6 +960,8 @@ keep_lumpy:
free_page_list(&free_pages);
list_splice(&ret_pages, page_list);
+ if (!scanning_global_lru(sc))
+ sc->memcg_record->nr_written[file] += nr_written;
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
}
@@ -1463,7 +1467,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
spin_unlock_irq(&zone->lru_lock);
- nr_reclaimed = shrink_page_list(&page_list, zone, sc);
+ nr_reclaimed = shrink_page_list(&page_list, zone, sc, file);
if (!scanning_global_lru(sc))
sc->memcg_record->nr_freed[file] += nr_reclaimed;
@@ -1471,7 +1475,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
set_reclaim_mode(priority, sc, true);
- nr_reclaimed += shrink_page_list(&page_list, zone, sc);
+ nr_reclaimed += shrink_page_list(&page_list, zone, sc, file);
}
local_irq_disable();
--
1.7.3.1
Thanks,
Andrew
On Wed, Jul 13, 2011 at 5:02 PM, KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
In my experience, if we do "batch" enouch, it works always better thanOn Tue, 12 Jul 2011 16:02:02 -0700
Andrew Bresticker <abrestic@xxxxxxxxxx> wrote:
> On Mon, Jul 11, 2011 at 3:30 AM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
>
> >
> > This patch is onto mmotm-0710... got bigger than expected ;(
> > ==
> > [PATCH] add memory.vmscan_stat
> >
> > commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..."
> > says it adds scanning stats to memory.stat file. But it doesn't because
> > we considered we needed to make a concensus for such new APIs.
> >
> > This patch is a trial to add memory.scan_stat. This shows
> > - the number of scanned pages(total, anon, file)
> > - the number of rotated pages(total, anon, file)
> > - the number of freed pages(total, anon, file)
> > - the number of elaplsed time (including sleep/pause time)
> >
> > for both of direct/soft reclaim.
> >
> > The biggest difference with oringinal Ying's one is that this file
> > can be reset by some write, as
> >
> > # echo 0 ...../memory.scan_stat
> >
> > Example of output is here. This is a result after make -j 6 kernel
> > under 300M limit.
> >
> > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
> > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
> > scanned_pages_by_limit 9471864
> > scanned_anon_pages_by_limit 6640629
> > scanned_file_pages_by_limit 2831235
> > rotated_pages_by_limit 4243974
> > rotated_anon_pages_by_limit 3971968
> > rotated_file_pages_by_limit 272006
> > freed_pages_by_limit 2318492
> > freed_anon_pages_by_limit 962052
> > freed_file_pages_by_limit 1356440
> > elapsed_ns_by_limit 351386416101
> > scanned_pages_by_system 0
> > scanned_anon_pages_by_system 0
> > scanned_file_pages_by_system 0
> > rotated_pages_by_system 0
> > rotated_anon_pages_by_system 0
> > rotated_file_pages_by_system 0
> > freed_pages_by_system 0
> > freed_anon_pages_by_system 0
> > freed_file_pages_by_system 0
> > elapsed_ns_by_system 0
> > scanned_pages_by_limit_under_hierarchy 9471864
> > scanned_anon_pages_by_limit_under_hierarchy 6640629
> > scanned_file_pages_by_limit_under_hierarchy 2831235
> > rotated_pages_by_limit_under_hierarchy 4243974
> > rotated_anon_pages_by_limit_under_hierarchy 3971968
> > rotated_file_pages_by_limit_under_hierarchy 272006
> > freed_pages_by_limit_under_hierarchy 2318492
> > freed_anon_pages_by_limit_under_hierarchy 962052
> > freed_file_pages_by_limit_under_hierarchy 1356440
> > elapsed_ns_by_limit_under_hierarchy 351386416101
> > scanned_pages_by_system_under_hierarchy 0
> > scanned_anon_pages_by_system_under_hierarchy 0
> > scanned_file_pages_by_system_under_hierarchy 0
> > rotated_pages_by_system_under_hierarchy 0
> > rotated_anon_pages_by_system_under_hierarchy 0
> > rotated_file_pages_by_system_under_hierarchy 0
> > freed_pages_by_system_under_hierarchy 0
> > freed_anon_pages_by_system_under_hierarchy 0
> > freed_file_pages_by_system_under_hierarchy 0
> > elapsed_ns_by_system_under_hierarchy 0
> >
> >
> > total_xxxx is for hierarchy management.
> >
> > This will be useful for further memcg developments and need to be
> > developped before we do some complicated rework on LRU/softlimit
> > management.
> >
> > This patch adds a new struct memcg_scanrecord into scan_control struct.
> > sc->nr_scanned at el is not designed for exporting information. For
> > example,
> > nr_scanned is reset frequentrly and incremented +2 at scanning mapped
> > pages.
> >
> > For avoiding complexity, I added a new param in scan_control which is for
> > exporting scanning score.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
> >
> > Changelog:
> > - renamed as vmscan_stat
> > - handle file/anon
> > - added "rotated"
> > - changed names of param in vmscan_stat.
> > ---
> > Documentation/cgroups/memory.txt | 85 +++++++++++++++++++
> > include/linux/memcontrol.h | 19 ++++
> > include/linux/swap.h | 6 -
> > mm/memcontrol.c | 172
> > +++++++++++++++++++++++++++++++++++++--
> > mm/vmscan.c | 39 +++++++-
> > 5 files changed, 303 insertions(+), 18 deletions(-)
> >
> > Index: mmotm-0710/Documentation/cgroups/memory.txt
> > ===================================================================
> > --- mmotm-0710.orig/Documentation/cgroups/memory.txt
> > +++ mmotm-0710/Documentation/cgroups/memory.txt
> > @@ -380,7 +380,7 @@ will be charged as a new owner of it.
> >
> > 5.2 stat file
> >
> > -memory.stat file includes following statistics
> > +5.2.1 memory.stat file includes following statistics
> >
> > # per-memory cgroup local status
> > cache - # of bytes of page cache memory.
> > @@ -438,6 +438,89 @@ Note:
> > file_mapped is accounted only when the memory cgroup is owner of
> > page
> > cache.)
> >
> > +5.2.2 memory.vmscan_stat
> > +
> > +memory.vmscan_stat includes statistics information for memory scanning and
> > +freeing, reclaiming. The statistics shows memory scanning information
> > since
> > +memory cgroup creation and can be reset to 0 by writing 0 as
> > +
> > + #echo 0 > ../memory.vmscan_stat
> > +
> > +This file contains following statistics.
> > +
> > +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy]
> > +[param]_elapsed_ns_by_[reason]_[under_hierarchy]
> > +
> > +For example,
> > +
> > + scanned_file_pages_by_limit indicates the number of scanned
> > + file pages at vmscan.
> > +
> > +Now, 3 parameters are supported
> > +
> > + scanned - the number of pages scanned by vmscan
> > + rotated - the number of pages activated at vmscan
> > + freed - the number of pages freed by vmscan
> > +
> > +If "rotated" is high against scanned/freed, the memcg seems busy.
> > +
> > +Now, 2 reason are supported
> > +
> > + limit - the memory cgroup's limit
> > + system - global memory pressure + softlimit
> > + (global memory pressure not under softlimit is not handled now)
> > +
> > +When under_hierarchy is added in the tail, the number indicates the
> > +total memcg scan of its children and itself.
> > +
> > +elapsed_ns is a elapsed time in nanosecond. This may include sleep time
> > +and not indicates CPU usage. So, please take this as just showing
> > +latency.
> > +
> > +Here is an example.
> > +
> > +# cat /cgroup/memory/A/memory.vmscan_stat
> > +scanned_pages_by_limit 9471864
> > +scanned_anon_pages_by_limit 6640629
> > +scanned_file_pages_by_limit 2831235
> > +rotated_pages_by_limit 4243974
> > +rotated_anon_pages_by_limit 3971968
> > +rotated_file_pages_by_limit 272006
> > +freed_pages_by_limit 2318492
> > +freed_anon_pages_by_limit 962052
> > +freed_file_pages_by_limit 1356440
> > +elapsed_ns_by_limit 351386416101
> > +scanned_pages_by_system 0
> > +scanned_anon_pages_by_system 0
> > +scanned_file_pages_by_system 0
> > +rotated_pages_by_system 0
> > +rotated_anon_pages_by_system 0
> > +rotated_file_pages_by_system 0
> > +freed_pages_by_system 0
> > +freed_anon_pages_by_system 0
> > +freed_file_pages_by_system 0
> > +elapsed_ns_by_system 0
> > +scanned_pages_by_limit_under_hierarchy 9471864
> > +scanned_anon_pages_by_limit_under_hierarchy 6640629
> > +scanned_file_pages_by_limit_under_hierarchy 2831235
> > +rotated_pages_by_limit_under_hierarchy 4243974
> > +rotated_anon_pages_by_limit_under_hierarchy 3971968
> > +rotated_file_pages_by_limit_under_hierarchy 272006
> > +freed_pages_by_limit_under_hierarchy 2318492
> > +freed_anon_pages_by_limit_under_hierarchy 962052
> > +freed_file_pages_by_limit_under_hierarchy 1356440
> > +elapsed_ns_by_limit_under_hierarchy 351386416101
> > +scanned_pages_by_system_under_hierarchy 0
> > +scanned_anon_pages_by_system_under_hierarchy 0
> > +scanned_file_pages_by_system_under_hierarchy 0
> > +rotated_pages_by_system_under_hierarchy 0
> > +rotated_anon_pages_by_system_under_hierarchy 0
> > +rotated_file_pages_by_system_under_hierarchy 0
> > +freed_pages_by_system_under_hierarchy 0
> > +freed_anon_pages_by_system_under_hierarchy 0
> > +freed_file_pages_by_system_under_hierarchy 0
> > +elapsed_ns_by_system_under_hierarchy 0
> > +
> > 5.3 swappiness
> >
> > Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups
> > only.
> > Index: mmotm-0710/include/linux/memcontrol.h
> > ===================================================================
> > --- mmotm-0710.orig/include/linux/memcontrol.h
> > +++ mmotm-0710/include/linux/memcontrol.h
> > @@ -39,6 +39,16 @@ extern unsigned long mem_cgroup_isolate_
> > struct mem_cgroup *mem_cont,
> > int active, int file);
> >
> > +struct memcg_scanrecord {
> > + struct mem_cgroup *mem; /* scanend memory cgroup */
> > + struct mem_cgroup *root; /* scan target hierarchy root */
> > + int context; /* scanning context (see memcontrol.c) */
> > + unsigned long nr_scanned[2]; /* the number of scanned pages */
> > + unsigned long nr_rotated[2]; /* the number of rotated pages */
> > + unsigned long nr_freed[2]; /* the number of freed pages */
> > + unsigned long elapsed; /* nsec of time elapsed while scanning */
> > +};
> > +
> > #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > /*
> > * All "charge" functions with gfp_mask should use GFP_KERNEL or
> > @@ -117,6 +127,15 @@ mem_cgroup_get_reclaim_stat_from_page(st
> > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> > struct task_struct *p);
> >
> > +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> > + gfp_t gfp_mask, bool
> > noswap,
> > + struct memcg_scanrecord
> > *rec);
> > +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> > + gfp_t gfp_mask, bool
> > noswap,
> > + struct zone *zone,
> > + struct memcg_scanrecord
> > *rec,
> > + unsigned long *nr_scanned);
> > +
> > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > extern int do_swap_account;
> > #endif
> > Index: mmotm-0710/include/linux/swap.h
> > ===================================================================
> > --- mmotm-0710.orig/include/linux/swap.h
> > +++ mmotm-0710/include/linux/swap.h
> > @@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st
> > /* linux/mm/vmscan.c */
> > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int
> > order,
> > gfp_t gfp_mask, nodemask_t *mask);
> > -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> > - gfp_t gfp_mask, bool
> > noswap);
> > -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> > - gfp_t gfp_mask, bool
> > noswap,
> > - struct zone *zone,
> > - unsigned long *nr_scanned);
> > extern int __isolate_lru_page(struct page *page, int mode, int file);
> > extern unsigned long shrink_all_memory(unsigned long nr_pages);
> > extern int vm_swappiness;
> > Index: mmotm-0710/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0710.orig/mm/memcontrol.c
> > +++ mmotm-0710/mm/memcontrol.c
> > @@ -204,6 +204,50 @@ struct mem_cgroup_eventfd_list {
> > static void mem_cgroup_threshold(struct mem_cgroup *mem);
> > static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
> >
> > +enum {
> > + SCAN_BY_LIMIT,
> > + SCAN_BY_SYSTEM,
> > + NR_SCAN_CONTEXT,
> > + SCAN_BY_SHRINK, /* not recorded now */
> > +};
> > +
> > +enum {
> > + SCAN,
> > + SCAN_ANON,
> > + SCAN_FILE,
> > + ROTATE,
> > + ROTATE_ANON,
> > + ROTATE_FILE,
> > + FREED,
> > + FREED_ANON,
> > + FREED_FILE,
> > + ELAPSED,
> > + NR_SCANSTATS,
> > +};
> > +
> > +struct scanstat {
> > + spinlock_t lock;
> > + unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS];
> > + unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS];
> > +};
> >
>
> I'm working on a similar effort with Ying here at Google and so far we've
> been using per-cpu counters for these statistics instead of spin-lock
> protected counters. Clearly the spin-lock protected counters have less
> memory overhead and make reading the stat file faster, but our concern is
> that this method is inconsistent with the other memory stat files such
> /proc/vmstat and /dev/cgroup/memory/.../memory.stat. Is there any
> particular reason you chose to use spin-lock protected counters instead of
> per-cpu counters?
>
percpu-counter. percpu counter is effective when batching is difficult.
This patch's implementation does enough batching and it's much coarse
grained than percpu counter. Then, this patch is better than percpu.
The percpu counte works effectively only when we use +1/-1 at each change of
> I've also modified your patch to use per-cpu counters instead of spin-lock
> protected counters. I tested it by doing streaming I/O from a ramdisk:
>
> $ mke2fs /dev/ram1
> $ mkdir /tmp/swapram
> $ mkdir /tmp/swapram/ram1
> $ mount -t ext2 /dev/ram1 /tmp/swapram/ram1
> $ dd if=/dev/urandom of=/tmp/swapram/ram1/file_16m bs=4096 count=4096
> $ mkdir /dev/cgroup/memory/1
> $ echo 8m > /dev/cgroup/memory/1
> $ ./ramdisk_load.sh 7
> $ echo $$ > /dev/cgroup/memory/1/tasks
> $ time for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m >
> /dev/zero; done
>
> Where ramdisk_load.sh is:
> for ((i=0; i<=$1; i++))
> do
> echo $$ >/dev/cgroup/memory/1/tasks
> for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m > /dev/zero;
> done &
> done
>
> Surprisingly, the per-cpu counters perform worse than the spin-lock
> protected counters. Over 10 runs of the test above, the per-cpu counters
> were 1.60% slower in both real time and sys time. I'm wondering if you have
> any insight as to why this is. I can provide my diff against your patch if
> necessary.
>
counters. It uses "batch" to merge the per-cpu value to the counter.
I think you use default "batch" value but the scan/rotate/free/elapsed value
is always larger than "batch" and you just added memory overhead and "if"
to pure spinlock counters.
Determining this "batch" threshold for percpu counter is difficult.
Thanks,
-Kame