Re: zram, OOM, and speed of allocation

Minchan Kim <minchan@xxxxxxxxxx> · Mon, 3 Dec 2012 16:38:24 +0900

Hi Luigi,

It's another patch without dependency with previous my patches.
You can control /proc/sys/vm/swappiness up to 200(which means VM reclaimer
can reclaim only anonymous pages) so I hope it makes swap device full while
file-backed page(ie, code pages) are protected from eviction.

I hope this patch removes your hacky min_filelist_kbytes.
Could you try this and send feedback?

>From 808cf60675af9731e68e4ae98c8ededef2b42350 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@xxxxxxxxxx>
Date: Mon, 3 Dec 2012 16:21:00 +0900
Subject: [PATCH] mm: increase swappiness to 200

We have thought swap out cost is very high but it's not true
if we use fast device like swap-over-zram. Nonetheless, we can
swap out 1:1 ratio of anon and page cache at most.
It's not enough to use swap device fully so we encounter OOM kill
while there are many free space in zram swap device. It's never
what we want.

This patch makes swap out aggressively.

Cc: Luigi Semenzato <semenzato@xxxxxxxxxx>
Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
---
 kernel/sysctl.c |    3 ++-
 mm/vmscan.c     |    6 ++++--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 693e0ed..f1dbd9d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -130,6 +130,7 @@ static int __maybe_unused two = 2;
 static int __maybe_unused three = 3;
 static unsigned long one_ul = 1;
 static int one_hundred = 100;
+extern int max_swappiness;
 #ifdef CONFIG_PRINTK
 static int ten_thousand = 10000;
 #endif
@@ -1157,7 +1158,7 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero,
-		.extra2		= &one_hundred,
+		.extra2		= &max_swappiness,
 	},
 #ifdef CONFIG_HUGETLB_PAGE
 	{
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 53dcde9..64f3c21 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -53,6 +53,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/vmscan.h>
 
+int max_swappiness = 200;
+
 struct scan_control {
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
@@ -1626,6 +1628,7 @@ static int vmscan_swappiness(struct scan_control *sc)
 	return mem_cgroup_swappiness(sc->target_mem_cgroup);
 }
 
+
 /*
  * Determine how aggressively the anon and file LRU lists should be
  * scanned.  The relative value of each set of LRU lists is determined
@@ -1701,11 +1704,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	}
 
 	/*
-	 * With swappiness at 100, anonymous and file have the same priority.
 	 * This scanning priority is essentially the inverse of IO cost.
 	 */
 	anon_prio = vmscan_swappiness(sc);
-	file_prio = 200 - anon_prio;
+	file_prio = max_swappiness - anon_prio;
 
 	/*
 	 * OK, so we have swap space and a fair amount of page cache
-- 
1.7.9.5


On Thu, Nov 29, 2012 at 11:31:46AM -0800, Luigi Semenzato wrote:
> Oh well, I found the problem, it's laptop_mode.  We keep it on by
> default.  When I turn it off, I can allocate as fast as I can, and no
> OOMs happen until swap is exhausted.
> 
> I don't think this is a desirable behavior even for laptop_mode, so if
> anybody wants to help me debug it (or wants my help in debugging it)
> do let me know.
> 
> Thanks!
> Luigi
> 
> On Thu, Nov 29, 2012 at 10:46 AM, Luigi Semenzato <semenzato@xxxxxxxxxx> wrote:
> > Minchan:
> >
> > I tried your suggestion to move the call to wake_all_kswapd from after
> > "restart:" to after "rebalance:".  The behavior is still similar, but
> > slightly improved.  Here's what I see.
> >
> > Allocating as fast as I can: 1.5 GB of the 3 GB of zram swap are used,
> > then OOM kills happen, and the system ends up with 1 GB swap used, 2
> > unused.
> >
> > Allocating 10 MB/s: some kills happen when only 1 to 1.5 GB are used,
> > and continue happening while swap fills up.  Eventually swap fills up
> > completely.  This is better than before (could not go past about 1 GB
> > of swap used), but there are too many kills too early.  I would like
> > to see no OOM kills until swap is full or almost full.
> >
> > Allocating 20 MB/s: almost as good as with 10 MB/s, but more kills
> > happen earlier, and not all swap space is used (400 MB free at the
> > end).
> >
> > This is with 200 processes using 20 MB each, and 2:1 compression ratio.
> >
> > So it looks like kswapd is still not aggressive enough in pushing
> > pages out.  What's the best way of changing that?  Play around with
> > the watermarks?
> >
> > Incidentally, I also tried removing the min_filelist_kbytes hacky
> > patch, but, as usual, the system thrashes so badly that it's
> > impossible to complete any experiment.  I set it to a lower minimum
> > amount of free file pages, 10 MB instead of the 50 MB which we use
> > normally, and I could run with some thrashing, but I got the same
> > results.
> >
> > Thanks!
> > Luigi
> >
> >
> > On Wed, Nov 28, 2012 at 4:31 PM, Luigi Semenzato <semenzato@xxxxxxxxxx> wrote:
> >> I am beginning to understand why zram appears to work fine on our x86
> >> systems but not on our ARM systems.  The bottom line is that swapping
> >> doesn't work as I would expect when allocation is "too fast".
> >>
> >> In one of my tests, opening 50 tabs simultaneously in a Chrome browser
> >> on devices with 2 GB of RAM and a zram-disk of 3 GB (uncompressed), I
> >> was observing that on the x86 device all of the zram swap space was
> >> used before OOM kills happened, but on the ARM device I would see OOM
> >> kills when only about 1 GB (out of 3) was swapped out.
> >>
> >> I wrote a simple program to understand this behavior.  The program
> >> (called "hog") allocates memory and fills it with a mix of
> >> incompressible data (from /dev/urandom) and highly compressible data
> >> (1's, just to avoid zero pages) in a given ratio.  The memory is never
> >> touched again.
> >>
> >> It turns out that if I don't limit the allocation speed, I see
> >> premature OOM kills also on the x86 device.  If I limit the allocation
> >> to 10 MB/s, the premature OOM kills stop happening on the x86 device,
> >> but still happen on the ARM device.  If I further limit the allocation
> >> speed to 5 Mb/s, the premature OOM kills disappear also from the ARM
> >> device.
> >>
> >> I have noticed a few time constants in the MM whose value is not well
> >> explained, and I am wondering if the code is tuned for some ideal
> >> system that doesn't behave like ours (considering, for instance, that
> >> zram is much faster than swapping to a disk device, but it also uses
> >> more CPU).  If this is plausible, I am wondering if anybody has
> >> suggestions for changes that I could try out to obtain a better
> >> behavior with a higher allocation speed.
> >>
> >> Thanks!
> >> Luigi
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>