Re: swap-add-a-simple-detector-for-inappropriate-swapin-readahead. patch

Hugh Dickins <hughd@xxxxxxxxxx> · Fri, 31 Jan 2014 20:15:15 -0800 (PST)

On Fri, 31 Jan 2014, Andrew Morton wrote:
> 
> I've been sitting on this one since September 2012 (gad) and I should
> get it off the books.  I didn't merge it because I really dislike all
> the magic number heuristics in there.  Can anyone think of anything
> smarter?

I've sometimes noticed it in mmotm, and thought, oh good it's on its way.
And then a few months later noticed that it isn't in Linus's tree, and
wondered what had gone wrong.  I'm more conscious of it this time around,
caught it weeping quietly at home when all the rest had gone to the party,
and was intending to write you a note.

I'm not going to apologize for the magic numbers any more than the
comment already admits about the "+ 2".  I think the 1s and 2s and 4s
are more honest than #define SWAPIN_READAHEAD_HIT_BOOST 2 etc.

If someone devises something smarter, and comparably low overhead,
by all means kick this out; but it started out as my reaction to
Shaohua's more sophisticated but less effective vma-based swapin
readahead limiter, and we both think it does a pretty good job
for something so minimal.

But I'm more conscious of it this time around, because I find myself
relying on it for a totally different reason than it was written for.
After upgrading the userspace on one machine, I found my old tmpfs
swapping kbuild tests were now getting OOM-killed when before they
had plugged along forever; and you can rightly say that I just need
to retune them a little, and all would be well.

But although they were OOM-killed on the vanilla kernel, they proceeded
happily with mmotm.  (And although they were OOM-killed when swapping
to SSD with vanilla kernel, they proceeded happily though sluggishly
when swapping to HDD: that's a difference I've not yet plumbed.)

What made the difference was this swapin readahead limiter: when you
are heavily into page reclaim, reading 7 pages in to the head of the
inactive list for every 1 that you actively want to use, is rather
bad strategy, and we end up OOMing because the LRU keeps getting
clogged up with unwanted pages (I think: I'm pretending to have
investigated it in greater depth than I honestly have).

This patch adapts automatically to that, which the vanilla kernel
currently does not: it's magic.

Hugh

> 
> 
> 
> From: Shaohua Li <shli@xxxxxxxxxx>
> Subject: swap: add a simple detector for inappropriate swapin readahead
> 
> This is a patch to improve swap readahead algorithm. It's from Hugh and I
> slightly changed it.
> 
> Hugh's original changelog:
> 
> swapin readahead does a blind readahead, whether or not the swapin
> is sequential.  This may be ok on harddisk, because large reads have
> relatively small costs, and if the readahead pages are unneeded they
> can be reclaimed easily - though, what if their allocation forced
> reclaim of useful pages?  But on SSD devices large reads are more
> expensive than small ones: if the readahead pages are unneeded,
> reading them in caused significant overhead.
> 
> This patch adds very simplistic random read detection.  Stealing
> the PageReadahead technique from Konstantin Khlebnikov's patch,
> avoiding the vma/anon_vma sophistications of Shaohua Li's patch,
> swapin_nr_pages() simply looks at readahead's current success
> rate, and narrows or widens its readahead window accordingly.
> There is little science to its heuristic: it's about as stupid
> as can be whilst remaining effective.
> 
> The table below shows elapsed times (in centiseconds) when running
> a single repetitive swapping load across a 1000MB mapping in 900MB
> ram with 1GB swap (the harddisk tests had taken painfully too long
> when I used mem=500M, but SSD shows similar results for that).
> 
> Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes
> his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1
> patch which Shaohua showed to be defective; HughNew this Nov 14
> patch, with page_cluster as usual at default of 3 (8-page reads);
> HughPC4 this same patch with page_cluster 4 (16-page reads);
> HughPC0 with page_cluster 0 (1-page reads: no readahead).
> 
> HDD for swapping to harddisk, SSD for swapping to VertexII SSD.
> Seq for sequential access to the mapping, cycling five times around;
> Rand for the same number of random touches.  Anon for a MAP_PRIVATE
> anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.
> 
> One weakness of Shaohua's vma/anon_vma approach was that it did
> not optimize Shmem: seen below.  Konstantin's approach was perhaps
> mistuned, 50% slower on Seq: did not compete and is not shown below.
> 
> HDD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
> Seq Anon     73921   76210   75611   76904   78191  121542
> Seq Shmem    73601   73176   73855   72947   74543  118322
> Rand Anon   895392  831243  871569  845197  846496  841680
> Rand Shmem 1058375 1053486  827935  764955  764376  756489
> 
> SSD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
> Seq Anon     24634   24198   24673   25107   21614   70018
> Seq Shmem    24959   24932   25052   25703   22030   69678
> Rand Anon    43014   26146   28075   25989   26935   25901
> Rand Shmem   45349   45215   28249   24268   24138   24332
> 
> These tests are, of course, two extremes of a very simple case:
> under heavier mixed loads I've not yet observed any consistent
> improvement or degradation, and wider testing would be welcome.
> 
> Shaohua Li:
> 
> Test shows Vanilla is slightly better in sequential workload than Hugh's patch.
> I observed with Hugh's patch sometimes the readahead size is shrinked too fast
> (from 8 to 1 immediately) in sequential workload if there is no hit. And in
> such case, continuing doing readahead is good actually.
> 
> I don't prepare a sophisticated algorithm for the sequential workload because
> so far we can't guarantee sequential accessed pages are swap out sequentially.
> So I slightly change Hugh's heuristic - don't shrink readahead size too fast.
> 
> Here is my test result (unit second, 3 runs average):
> 	Vanilla		Hugh		New
> Seq	356		370		360
> Random	4525		2447		2444
> 
> Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'.
> The first part is running a random workload (till around 1200 of the x-axis)
> and the second part is running a sequential workload. swapin and swapout
> throughput are almost identical in steady state in both workloads. These are
> expected behavior. while in Vanilla, swapin is much bigger than swapout
> especially in random workload (because wrong readahead).
> 
> Original patches by: Shaohua Li and Konstantin Khlebnikov.
> 
> [fengguang.wu@xxxxxxxxx: swapin_nr_pages() can be static]
> Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
> Signed-off-by: Shaohua Li <shli@xxxxxxxxxxxx>
> Signed-off-by: Fengguang Wu <fengguang.wu@xxxxxxxxx>
> Cc: Rik van Riel <riel@xxxxxxxxxx>
> Cc: Wu Fengguang <fengguang.wu@xxxxxxxxx>
> Cc: Minchan Kim <minchan@xxxxxxxxxx>
> Cc: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> ---
> 
>  include/linux/page-flags.h |    4 +-
>  mm/swap_state.c            |   63 +++++++++++++++++++++++++++++++++--
>  2 files changed, 62 insertions(+), 5 deletions(-)
> 
> diff -puN include/linux/page-flags.h~swap-add-a-simple-detector-for-inappropriate-swapin-readahead include/linux/page-flags.h
> --- a/include/linux/page-flags.h~swap-add-a-simple-detector-for-inappropriate-swapin-readahead
> +++ a/include/linux/page-flags.h
> @@ -228,9 +228,9 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
>  TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
>  PAGEFLAG(MappedToDisk, mappedtodisk)
>  
> -/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
> +/* PG_readahead is only used for reads; PG_reclaim is only for writes */
>  PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
> -PAGEFLAG(Readahead, reclaim)		/* Reminder to do async read-ahead */
> +PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
>  
>  #ifdef CONFIG_HIGHMEM
>  /*
> diff -puN mm/swap_state.c~swap-add-a-simple-detector-for-inappropriate-swapin-readahead mm/swap_state.c
> --- a/mm/swap_state.c~swap-add-a-simple-detector-for-inappropriate-swapin-readahead
> +++ a/mm/swap_state.c
> @@ -63,6 +63,8 @@ unsigned long total_swapcache_pages(void
>  	return ret;
>  }
>  
> +static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
> +
>  void show_swap_cache_info(void)
>  {
>  	printk("%lu pages in swap cache\n", total_swapcache_pages());
> @@ -286,8 +288,11 @@ struct page * lookup_swap_cache(swp_entr
>  
>  	page = find_get_page(swap_address_space(entry), entry.val);
>  
> -	if (page)
> +	if (page) {
>  		INC_CACHE_INFO(find_success);
> +		if (TestClearPageReadahead(page))
> +			atomic_inc(&swapin_readahead_hits);
> +	}
>  
>  	INC_CACHE_INFO(find_total);
>  	return page;
> @@ -389,6 +394,50 @@ struct page *read_swap_cache_async(swp_e
>  	return found_page;
>  }
>  
> +static unsigned long swapin_nr_pages(unsigned long offset)
> +{
> +	static unsigned long prev_offset;
> +	unsigned int pages, max_pages, last_ra;
> +	static atomic_t last_readahead_pages;
> +
> +	max_pages = 1 << ACCESS_ONCE(page_cluster);
> +	if (max_pages <= 1)
> +		return 1;
> +
> +	/*
> +	 * This heuristic has been found to work well on both sequential and
> +	 * random loads, swapping to hard disk or to SSD: please don't ask
> +	 * what the "+ 2" means, it just happens to work well, that's all.
> +	 */
> +	pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
> +	if (pages == 2) {
> +		/*
> +		 * We can have no readahead hits to judge by: but must not get
> +		 * stuck here forever, so check for an adjacent offset instead
> +		 * (and don't even bother to check whether swap type is same).
> +		 */
> +		if (offset != prev_offset + 1 && offset != prev_offset - 1)
> +			pages = 1;
> +		prev_offset = offset;
> +	} else {
> +		unsigned int roundup = 4;
> +		while (roundup < pages)
> +			roundup <<= 1;
> +		pages = roundup;
> +	}
> +
> +	if (pages > max_pages)
> +		pages = max_pages;
> +
> +	/* Don't shrink readahead too fast */
> +	last_ra = atomic_read(&last_readahead_pages) / 2;
> +	if (pages < last_ra)
> +		pages = last_ra;
> +	atomic_set(&last_readahead_pages, pages);
> +
> +	return pages;
> +}
> +
>  /**
>   * swapin_readahead - swap in pages in hope we need them soon
>   * @entry: swap entry of this memory
> @@ -412,11 +461,16 @@ struct page *swapin_readahead(swp_entry_
>  			struct vm_area_struct *vma, unsigned long addr)
>  {
>  	struct page *page;
> -	unsigned long offset = swp_offset(entry);
> +	unsigned long entry_offset = swp_offset(entry);
> +	unsigned long offset = entry_offset;
>  	unsigned long start_offset, end_offset;
> -	unsigned long mask = (1UL << page_cluster) - 1;
> +	unsigned long mask;
>  	struct blk_plug plug;
>  
> +	mask = swapin_nr_pages(offset) - 1;
> +	if (!mask)
> +		goto skip;
> +
>  	/* Read a page_cluster sized and aligned cluster around offset. */
>  	start_offset = offset & ~mask;
>  	end_offset = offset | mask;
> @@ -430,10 +484,13 @@ struct page *swapin_readahead(swp_entry_
>  						gfp_mask, vma, addr);
>  		if (!page)
>  			continue;
> +		if (offset != entry_offset)
> +			SetPageReadahead(page);
>  		page_cache_release(page);
>  	}
>  	blk_finish_plug(&plug);
>  
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
> +skip:
>  	return read_swap_cache_async(entry, gfp_mask, vma, addr);
>  }
> _
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>