- mm-implement-swap-prefetching.patch removed from -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Tue, 28 Aug 2007 01:13:37 -0700

The patch titled
     mm: Implement swap prefetching
has been removed from the -mm tree.  Its filename was
     mm-implement-swap-prefetching.patch

This patch was dropped because it was withdrawn

------------------------------------------------------
Subject: mm: Implement swap prefetching
From: Con Kolivas <kernel@xxxxxxxxxxx>

Implement swap prefetching when the vm is relatively idle and there is free
ram available.  The code is based on some preliminary code by Thomas
Schlichter.

This stores a list of swapped entries in a list ordered most recently used
and a radix tree.  It generates a low priority kernel thread running at
nice 19 to do the prefetching at a later stage.

Once pages have been added to the swapped list, a timer is started, testing
for conditions suitable to prefetch swap pages every 5 seconds.  Suitable
conditions are defined as lack of swapping out or in any pages, and no
watermark tests failing.  Significant amounts of dirtied ram and changes in
free ram representing disk writes or reads also prevent prefetching.

It then checks that we have spare ram looking for at least 3* pages_high
free per zone and if it succeeds that will prefetch pages from swap into
the swap cache.  The pages are added to the tail of the inactive list to
preserve LRU ordering.

Pages are prefetched until the list is empty or the vm is seen as busy
according to the previously described criteria.  Node data on numa is
stored with the entries and an appropriate zonelist based on this is used
when allocating ram.

The pages are copied to swap cache and kept on backing store.  This allows
pressure on either physical ram or swap to readily find free pages without
further I/O.

Prefetching can be enabled/disabled via the tunable in
/proc/sys/vm/swap_prefetch initially set to 1 (enabled).

Enabling laptop_mode disables swap prefetching to prevent unnecessary spin
ups.

In testing on modern pc hardware this results in wall-clock time activation
of the firefox browser to speed up 5 fold after a worst case complete
swap-out of the browser on a static web page.


  Fix potential swap-prefetch deadlock, found by the locking correctness
  validator.

[clameter@xxxxxxx: ZVC writeback: Fix mm and other issues]
[clameter@xxxxxxx: ZVC writeback: Fix mm and other issues]
[akpm@xxxxxxxx: Don't add new sysctl numbers]
Signed-off-by: Con Kolivas <kernel@xxxxxxxxxxx>
Signed-off-by: Ingo Molnar <mingo@xxxxxxx>
Signed-off-by: Christoph Lameter <clameter@xxxxxxx>
DESC
make mm/swap_prefetch.c:remove_from_swapped_list() static
EDESC

Signed-off-by: Adrian Bunk <bunk@xxxxxxxxx>
Acked-by: Con Kolivas <kernel@xxxxxxxxxxx>
DESC
swap prefetch: avoid repeating entry
EDESC

Avoid entering trickle_swap() when first initialising kprefetchd to prevent
endless loops.

Signed-off-by: Con Kolivas <kernel@xxxxxxxxxxx>
DESC
mm: swap prefetch improvements
EDESC

Numerous improvements to swap prefetch.

It was possible for kprefetchd to go to sleep indefinitely before/after
changing the /proc value of swap prefetch. Fix that.

The cost of remove_from_swapped_list() can be removed from every page swapin
by moving it to be done entirely by kprefetchd lazily.

The call site for add_to_swapped_list need only be at one place.

Wakeups can occur much less frequently if swap prefetch is disabled.

Make it possible to enable swap prefetch explicitly via /proc when laptop_mode
is enabled by changing the value of the sysctl to 2.

The complicated iteration over every entry can be consolidated by using
list_for_each_safe.

Fix potential irq problem by converting read_lock_irq to irqsave etc.

Code style fixes.

Change the ioprio from IOPRIO_CLASS_IDLE to normal lower priority to ensure
that bio requests are not starved if other I/O begins during prefetching.

Signed-off-by: Con Kolivas <kernel@xxxxxxxxxxx>
DESC
mm: swap prefetch more improvements
EDESC

Failed radix_tree_insert wasn't being handled leaving stale kmem.

The list should be iterated over in the reverse order when prefetching.

Make the yield within kprefetchd stronger through the use of cond_resched.

Check that the pos entry hasn't been removed while unlocked.

Signed-off-by: Con Kolivas <kernel@xxxxxxxxxxx>
DESC
mm: swap prefetch increase aggressiveness and tunability
EDESC

Swap prefetch is currently too lax in prefetching with extended idle
periods unused.  Increase its aggressiveness and tunability.

Make it possible for swap_prefetch to be set to a high value ignoring load
and prefetching regardless.

Add tunables to modify the swap prefetch delay and sleep period on the fly,
and decrease both periods to 1 and 5 seconds respectively.  Extended
periods did not decrease the impact any further but greatly diminished the
rate ram was prefetched.

Remove the prefetch_watermark that left free ram unused.  The impact of
using the free ram with prefetched pages being put on the tail end of the
inactive list would be minimal and potentially very beneficial, yet testing
the pagestate adds unnecessary expense.

Put kprefetchd to sleep if the low watermarks are hit instead of delaying it.

Increase the maxcount as the lazy removal of swapped entries means we can
easily have many stale entries and not enough entries for good swap
prefetch.

Do not delay prefetch in cond_resched() returning positive.  That was
pointless and frequently put kprefetchd to sleep for no reason.

Update comments and documentation.

Signed-off-by: Con Kolivas <kernel@xxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 Documentation/sysctl/vm.txt   |   37 ++
 include/linux/mm_inline.h     |    7 
 include/linux/swap-prefetch.h |   53 +++
 include/linux/swap.h          |    2 
 init/Kconfig                  |   22 +
 kernel/sysctl.c               |   27 +
 mm/Makefile                   |    1 
 mm/page_io.c                  |    2 
 mm/swap.c                     |   46 ++
 mm/swap_prefetch.c            |  542 ++++++++++++++++++++++++++++++++
 mm/swap_state.c               |    9 
 mm/vmscan.c                   |    4 
 12 files changed, 751 insertions(+), 1 deletion(-)

diff -puN Documentation/sysctl/vm.txt~mm-implement-swap-prefetching Documentation/sysctl/vm.txt

--- a/Documentation/sysctl/vm.txt~mm-implement-swap-prefetching
+++ a/Documentation/sysctl/vm.txt
@@ -33,6 +33,9 @@ Currently, these files are in /proc/sys/
 - panic_on_oom
 - mmap_min_address
 - numa_zonelist_order
+- swap_prefetch
+- swap_prefetch_delay
+- swap_prefetch_sleep
 
 ==============================================================
 
@@ -277,3 +280,37 @@ will select "node" order in following ca
 
 Otherwise, "zone" order will be selected. Default order is recommended unless
 this is causing problems for your system/application.
+
+==============================================================
+
+swap_prefetch
+
+This enables or disables the swap prefetching feature. When the virtual
+memory subsystem has been extremely idle for at least swap_prefetch_sleep
+seconds it will start copying back pages from swap into the swapcache and keep
+a copy in swap. Valid values are 0 - 3. A value of 0 disables swap
+prefetching, 1 enables it unless laptop_mode is enabled, 2 enables it in the
+presence of laptop_mode, and 3 enables it unconditionally, ignoring whether
+the system is idle or not. If set to 0, swap prefetch wil not even try to keep
+record of ram swapped out to have the most minimal impact on performance.
+
+The default value is 1.
+
+==============================================================
+
+swap_prefetch_delay
+
+This is the time in seconds that swap prefetching is delayed upon finding
+the system is not idle (ie the vm is busy or non-niced cpu load is present).
+
+The default value is 1.
+
+==============================================================
+
+swap_prefetch_sleep
+
+This is the time in seconds that the swap prefetch kernel thread is put to
+sleep for when the ram is found to be full and it is unable to prefetch
+further.
+
+The default value is 5.
diff -puN include/linux/mm_inline.h~mm-implement-swap-prefetching include/linux/mm_inline.h
--- a/include/linux/mm_inline.h~mm-implement-swap-prefetching
+++ a/include/linux/mm_inline.h
@@ -13,6 +13,13 @@ add_page_to_inactive_list(struct zone *z
 }
 
 static inline void
+add_page_to_inactive_list_tail(struct zone *zone, struct page *page)
+{
+	list_add_tail(&page->lru, &zone->inactive_list);
+	__inc_zone_state(zone, NR_INACTIVE);
+}
+
+static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
diff -puN /dev/null include/linux/swap-prefetch.h
--- /dev/null
+++ a/include/linux/swap-prefetch.h
@@ -0,0 +1,53 @@
+#ifndef SWAP_PREFETCH_H_INCLUDED
+#define SWAP_PREFETCH_H_INCLUDED
+
+#ifdef CONFIG_SWAP_PREFETCH
+/* mm/swap_prefetch.c */
+extern int swap_prefetch;
+extern int swap_prefetch_delay;
+extern int swap_prefetch_sleep;
+
+struct swapped_entry {
+	swp_entry_t		swp_entry;	/* The actual swap entry */
+	struct list_head	swapped_list;	/* Linked list of entries */
+#if MAX_NUMNODES > 1
+	int			node;		/* Node id */
+#endif
+} __attribute__((packed));
+
+static inline void store_swap_entry_node(struct swapped_entry *entry,
+	struct page *page)
+{
+#if MAX_NUMNODES > 1
+	entry->node = page_to_nid(page);
+#endif
+}
+
+static inline int get_swap_entry_node(struct swapped_entry *entry)
+{
+#if MAX_NUMNODES > 1
+	return entry->node;
+#else
+	return 0;
+#endif
+}
+
+extern void add_to_swapped_list(struct page *page);
+extern void delay_swap_prefetch(void);
+extern void prepare_swap_prefetch(void);
+
+#else	/* CONFIG_SWAP_PREFETCH */
+static inline void add_to_swapped_list(struct page *__unused)
+{
+}
+
+static inline void prepare_swap_prefetch(void)
+{
+}
+
+static inline void delay_swap_prefetch(void)
+{
+}
+#endif	/* CONFIG_SWAP_PREFETCH */
+
+#endif		/* SWAP_PREFETCH_H_INCLUDED */
diff -puN include/linux/swap.h~mm-implement-swap-prefetching include/linux/swap.h
--- a/include/linux/swap.h~mm-implement-swap-prefetching
+++ a/include/linux/swap.h
@@ -180,6 +180,7 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
+extern void FASTCALL(lru_cache_add_tail(struct page *));
 extern void FASTCALL(activate_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
@@ -238,6 +239,7 @@ extern void free_pages_and_swap_cache(st
 extern struct page * lookup_swap_cache(swp_entry_t);
 extern struct page * read_swap_cache_async(swp_entry_t, struct vm_area_struct *vma,
 					   unsigned long addr);
+extern int add_to_swap_cache(struct page *page, swp_entry_t entry);
 /* linux/mm/swapfile.c */
 extern long total_swap_pages;
 extern unsigned int nr_swapfiles;
diff -puN init/Kconfig~mm-implement-swap-prefetching init/Kconfig
--- a/init/Kconfig~mm-implement-swap-prefetching
+++ a/init/Kconfig
@@ -102,6 +102,28 @@ config SWAP
 	  used to provide more virtual memory than the actual RAM present
 	  in your computer.  If unsure say Y.
 
+config SWAP_PREFETCH
+	bool "Support for prefetching swapped memory"
+	depends on SWAP
+	default y
+	---help---
+	  This option will allow the kernel to prefetch swapped memory pages
+	  when idle. The pages will be kept on both swap and in swap_cache
+	  thus avoiding the need for further I/O if either ram or swap space
+	  is required.
+
+	  What this will do on workstations is slowly bring back applications
+	  that have swapped out after memory intensive workloads back into
+	  physical ram if you have free ram at a later stage and the machine
+	  is relatively idle. This means that when you come back to your
+	  computer after leaving it idle for a while, applications will come
+	  to life faster. Note that your swap usage will appear to increase
+	  but these are cached pages, can be dropped freely by the vm, and it
+	  should stabilise around 50% swap usage maximum.
+
+	  Workstations and multiuser workstation servers will most likely want
+	  to say Y.
+
 config SYSVIPC
 	bool "System V IPC"
 	---help---
diff -puN kernel/sysctl.c~mm-implement-swap-prefetching kernel/sysctl.c
--- a/kernel/sysctl.c~mm-implement-swap-prefetching
+++ a/kernel/sysctl.c
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 #include <linux/swap.h>
 #include <linux/slab.h>
+#include <linux/swap-prefetch.h>
 #include <linux/sysctl.h>
 #include <linux/proc_fs.h>
 #include <linux/capability.h>
@@ -1021,6 +1022,32 @@ static ctl_table vm_table[] = {
 		.extra2		= &one_hundred,
 	},
 #endif
+#ifdef CONFIG_SWAP_PREFETCH
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "swap_prefetch",
+		.data		= &swap_prefetch,
+		.maxlen		= sizeof(swap_prefetch),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "swap_prefetch_delay",
+		.data		= &swap_prefetch_delay,
+		.maxlen		= sizeof(swap_prefetch_delay),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "swap_prefetch_sleep",
+		.data		= &swap_prefetch_sleep,
+		.maxlen		= sizeof(swap_prefetch_sleep),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+#endif
 #ifdef CONFIG_SMP
 	{
 		.ctl_name	= CTL_UNNUMBERED,
diff -puN mm/Makefile~mm-implement-swap-prefetching mm/Makefile
--- a/mm/Makefile~mm-implement-swap-prefetching
+++ a/mm/Makefile
@@ -15,6 +15,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
 
 obj-$(CONFIG_BOUNCE)	+= bounce.o
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
+obj-$(CONFIG_SWAP_PREFETCH) += swap_prefetch.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
diff -puN mm/page_io.c~mm-implement-swap-prefetching mm/page_io.c
--- a/mm/page_io.c~mm-implement-swap-prefetching
+++ a/mm/page_io.c
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/swapops.h>
 #include <linux/writeback.h>
+#include <linux/swap-prefetch.h>
 #include <asm/pgtable.h>
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -118,6 +119,7 @@ int swap_writepage(struct page *page, st
 		ret = -ENOMEM;
 		goto out;
 	}
+	add_to_swapped_list(page);
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		rw |= (1 << BIO_RW_SYNC);
 	count_vm_event(PSWPOUT);
diff -puN mm/swap.c~mm-implement-swap-prefetching mm/swap.c
--- a/mm/swap.c~mm-implement-swap-prefetching
+++ a/mm/swap.c
@@ -17,6 +17,7 @@
 #include <linux/sched.h>
 #include <linux/kernel_stat.h>
 #include <linux/swap.h>
+#include <linux/swap-prefetch.h>
 #include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/pagevec.h>
@@ -174,6 +175,7 @@ EXPORT_SYMBOL(mark_page_accessed);
  */
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvecs) = { 0, };
 static DEFINE_PER_CPU(struct pagevec, lru_add_active_pvecs) = { 0, };
+static DEFINE_PER_CPU(struct pagevec, lru_add_tail_pvecs) = { 0, };
 
 void fastcall lru_cache_add(struct page *page)
 {
@@ -195,6 +197,31 @@ void fastcall lru_cache_add_active(struc
 	put_cpu_var(lru_add_active_pvecs);
 }
 
+static void __pagevec_lru_add_tail(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		BUG_ON(PageLRU(page));
+		SetPageLRU(page);
+		add_page_to_inactive_list_tail(zone, page);
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+
 static void __lru_add_drain(int cpu)
 {
 	struct pagevec *pvec = &per_cpu(lru_add_pvecs, cpu);
@@ -205,6 +232,9 @@ static void __lru_add_drain(int cpu)
 	pvec = &per_cpu(lru_add_active_pvecs, cpu);
 	if (pagevec_count(pvec))
 		__pagevec_lru_add_active(pvec);
+	pvec = &per_cpu(lru_add_tail_pvecs, cpu);
+	if (pagevec_count(pvec))
+		__pagevec_lru_add_tail(pvec);
 }
 
 void lru_add_drain(void)
@@ -401,6 +431,21 @@ void __pagevec_lru_add_active(struct pag
 }
 
 /*
+ * Function used uniquely to put pages back to the lru at the end of the
+ * inactive list to preserve the lru order. Currently only used by swap
+ * prefetch.
+ */
+void fastcall lru_cache_add_tail(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(lru_add_tail_pvecs);
+
+	page_cache_get(page);
+	if (!pagevec_add(pvec, page))
+		__pagevec_lru_add_tail(pvec);
+	put_cpu_var(lru_add_pvecs);
+}
+
+/*
  * Try to drop buffers from the pages in a pagevec
  */
 void pagevec_strip(struct pagevec *pvec)
@@ -512,6 +557,7 @@ void __init swap_setup(void)
 	 * Right now other parts of the system means that we
 	 * _really_ don't want to cluster much more
 	 */
+	prepare_swap_prefetch();
 #ifdef CONFIG_HOTPLUG_CPU
 	hotcpu_notifier(cpu_swap_callback, 0);
 #endif
diff -puN /dev/null mm/swap_prefetch.c
--- /dev/null
+++ a/mm/swap_prefetch.c
@@ -0,0 +1,542 @@
+/*
+ * linux/mm/swap_prefetch.c
+ *
+ * Copyright (C) 2005-2007 Con Kolivas
+ *
+ * Written by Con Kolivas <kernel@xxxxxxxxxxx>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/swap-prefetch.h>
+#include <linux/ioprio.h>
+#include <linux/kthread.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/writeback.h>
+#include <linux/vmstat.h>
+#include <linux/freezer.h>
+
+/*
+ * sysctls:
+ * swap_prefetch:	0. Disable swap prefetching
+ *			1. Prefetch only when idle and not with laptop_mode
+ *			2. Prefetch when idle and with laptop_mode
+ *			3. Prefetch at all times.
+ * swap_prefetch_delay:	Number of seconds to delay prefetching when system
+ *			is not idle.
+ * swap_prefetch_sleep:	Number of seconds to put kprefetchd to sleep when
+ *			unable to prefetch.
+ */
+int swap_prefetch __read_mostly = 1;
+int swap_prefetch_delay __read_mostly = 1;
+int swap_prefetch_sleep __read_mostly = 5;
+
+#define PREFETCH_DELAY		(HZ * swap_prefetch_delay)
+#define PREFETCH_SLEEP		((HZ * swap_prefetch_sleep) ? : 1)
+
+struct swapped_root {
+	unsigned long		busy;		/* vm busy */
+	spinlock_t		lock;		/* protects all data */
+	struct list_head	list;		/* MRU list of swapped pages */
+	struct radix_tree_root	swap_tree;	/* Lookup tree of pages */
+	unsigned int		count;		/* Number of entries */
+	unsigned int		maxcount;	/* Maximum entries allowed */
+	struct kmem_cache	*cache;		/* Of struct swapped_entry */
+};
+
+static struct swapped_root swapped = {
+	.lock		= SPIN_LOCK_UNLOCKED,
+	.list  		= LIST_HEAD_INIT(swapped.list),
+	.swap_tree	= RADIX_TREE_INIT(GFP_ATOMIC),
+};
+
+static struct task_struct *kprefetchd_task;
+
+/*
+ * We check to see no part of the vm is busy. If it is this will interrupt
+ * trickle_swap and wait another PREFETCH_DELAY. Purposefully racy.
+ */
+inline void delay_swap_prefetch(void)
+{
+	if (!test_bit(0, &swapped.busy))
+		__set_bit(0, &swapped.busy);
+}
+
+/*
+ * If laptop_mode is enabled don't prefetch to avoid hard drives
+ * doing unnecessary spin-ups unless swap_prefetch is explicitly
+ * set to a higher value.
+ */
+static inline int prefetch_enabled(void)
+{
+	if (swap_prefetch <= laptop_mode)
+		return 0;
+	return 1;
+}
+
+static int kprefetchd_awake;
+
+/*
+ * Drop behind accounting which keeps a list of the most recently used swap
+ * entries. Entries are removed lazily by kprefetchd.
+ */
+void add_to_swapped_list(struct page *page)
+{
+	struct swapped_entry *entry;
+	unsigned long index, flags;
+
+	if (!prefetch_enabled())
+		goto out;
+
+	spin_lock_irqsave(&swapped.lock, flags);
+	if (swapped.count >= swapped.maxcount) {
+		/*
+		 * Once the number of entries exceeds maxcount we start
+		 * removing the least recently used entries.
+		 */
+		entry = list_entry(swapped.list.next,
+			struct swapped_entry, swapped_list);
+		radix_tree_delete(&swapped.swap_tree, entry->swp_entry.val);
+		list_del(&entry->swapped_list);
+		swapped.count--;
+	} else {
+		entry = kmem_cache_alloc(swapped.cache, GFP_ATOMIC);
+		if (unlikely(!entry))
+			/* bad, can't allocate more mem */
+			goto out_locked;
+	}
+
+	index = page_private(page);
+	entry->swp_entry.val = index;
+	/*
+	 * On numa we need to store the node id to ensure that we prefetch to
+	 * the same node it came from.
+	 */
+	store_swap_entry_node(entry, page);
+
+	if (likely(!radix_tree_insert(&swapped.swap_tree, index, entry))) {
+		list_add(&entry->swapped_list, &swapped.list);
+		swapped.count++;
+	} else
+		kmem_cache_free(swapped.cache, entry);
+
+out_locked:
+	spin_unlock_irqrestore(&swapped.lock, flags);
+out:
+	if (!kprefetchd_awake)
+		wake_up_process(kprefetchd_task);
+	return;
+}
+
+/*
+ * Removes entries from the swapped_list. The radix tree allows us to quickly
+ * look up the entry from the index without having to iterate over the whole
+ * list.
+ */
+static void remove_from_swapped_list(const unsigned long index)
+{
+	struct swapped_entry *entry;
+	unsigned long flags;
+
+	spin_lock_irqsave(&swapped.lock, flags);
+	entry = radix_tree_delete(&swapped.swap_tree, index);
+	if (likely(entry)) {
+		list_del(&entry->swapped_list);
+		swapped.count--;
+		kmem_cache_free(swapped.cache, entry);
+	}
+	spin_unlock_irqrestore(&swapped.lock, flags);
+}
+
+enum trickle_return {
+	TRICKLE_SUCCESS,
+	TRICKLE_FAILED,
+	TRICKLE_DELAY,
+};
+
+struct node_stats {
+	/* Free ram after a cycle of prefetching */
+	unsigned long	last_free;
+	/* Free ram on this cycle of checking prefetch_suitable */
+	unsigned long	current_free;
+	/* The amount of free ram before we start prefetching */
+	unsigned long	highfree[MAX_NR_ZONES];
+	/* The amount of free ram where we will stop prefetching */
+	unsigned long	lowfree[MAX_NR_ZONES];
+	/* highfree or lowfree depending on whether we've hit a watermark */
+	unsigned long	*pointfree[MAX_NR_ZONES];
+};
+
+/*
+ * prefetch_stats stores the free ram data of each node and this is used to
+ * determine if a node is suitable for prefetching into.
+ */
+struct prefetch_stats {
+	/* Which nodes are currently suited to prefetching */
+	nodemask_t	prefetch_nodes;
+	/* Total pages we've prefetched on this wakeup of kprefetchd */
+	unsigned long	prefetched_pages;
+	struct node_stats node[MAX_NUMNODES];
+};
+
+static struct prefetch_stats sp_stat;
+
+/*
+ * This tries to read a swp_entry_t into swap cache for swap prefetching.
+ * If it returns TRICKLE_DELAY we should delay further prefetching.
+ */
+static enum trickle_return trickle_swap_cache_async(const swp_entry_t entry,
+	const int node)
+{
+	enum trickle_return ret = TRICKLE_FAILED;
+	unsigned long flags;
+	struct page *page;
+
+	read_lock_irqsave(&swapper_space.tree_lock, flags);
+	/* Entry may already exist */
+	page = radix_tree_lookup(&swapper_space.page_tree, entry.val);
+	read_unlock_irqrestore(&swapper_space.tree_lock, flags);
+	if (page)
+		goto out;
+
+	/*
+	 * Get a new page to read from swap. We have already checked the
+	 * watermarks so __alloc_pages will not call on reclaim.
+	 */
+	page = alloc_pages_node(node, GFP_HIGHUSER & ~__GFP_WAIT, 0);
+	if (unlikely(!page)) {
+		ret = TRICKLE_DELAY;
+		goto out;
+	}
+
+	if (add_to_swap_cache(page, entry)) {
+		/* Failed to add to swap cache */
+		goto out_release;
+	}
+
+	/* Add them to the tail of the inactive list to preserve LRU order */
+	lru_cache_add_tail(page);
+	if (unlikely(swap_readpage(NULL, page)))
+		goto out_release;
+
+	sp_stat.prefetched_pages++;
+	sp_stat.node[node].last_free--;
+
+	ret = TRICKLE_SUCCESS;
+out_release:
+	page_cache_release(page);
+out:
+	/*
+	 * All entries are removed here lazily. This avoids the cost of
+	 * remove_from_swapped_list during normal swapin. Thus there are
+	 * usually many stale entries.
+	 */
+	remove_from_swapped_list(entry.val);
+	return ret;
+}
+
+static void clear_last_prefetch_free(void)
+{
+	int node;
+
+	/*
+	 * Reset the nodes suitable for prefetching to all nodes. We could
+	 * update the data to take into account memory hotplug if desired..
+	 */
+	sp_stat.prefetch_nodes = node_online_map;
+	for_each_node_mask(node, sp_stat.prefetch_nodes) {
+		struct node_stats *ns = &sp_stat.node[node];
+
+		ns->last_free = 0;
+	}
+}
+
+static void clear_current_prefetch_free(void)
+{
+	int node;
+
+	sp_stat.prefetch_nodes = node_online_map;
+	for_each_node_mask(node, sp_stat.prefetch_nodes) {
+		struct node_stats *ns = &sp_stat.node[node];
+
+		ns->current_free = 0;
+	}
+}
+
+/*
+ * This updates the high and low watermarks of amount of free ram in each
+ * node used to start and stop prefetching. We prefetch from pages_high * 4
+ * down to pages_high * 3.
+ */
+static void examine_free_limits(void)
+{
+	struct zone *z;
+
+	for_each_zone(z) {
+		struct node_stats *ns;
+		int idx;
+
+		if (!populated_zone(z))
+			continue;
+
+		ns = &sp_stat.node[zone_to_nid(z)];
+		idx = zone_idx(z);
+		ns->lowfree[idx] = z->pages_high * 3;
+		ns->highfree[idx] = ns->lowfree[idx] + z->pages_high;
+
+		if (zone_page_state(z, NR_FREE_PAGES) > ns->highfree[idx]) {
+			/*
+			 * We've gotten above the high watermark of free pages
+			 * so we can start prefetching till we get to the low
+			 * watermark.
+			 */
+			ns->pointfree[idx] = &ns->lowfree[idx];
+		}
+	}
+}
+
+/*
+ * We want to be absolutely certain it's ok to start prefetching.
+ */
+static enum trickle_return prefetch_suitable(void)
+{
+	enum trickle_return ret = TRICKLE_DELAY;
+	struct zone *z;
+	int node;
+
+	/*
+	 * If swap_prefetch is set to a high value we can ignore load
+	 * and prefetch whenever we can. Otherwise we test for vm and
+	 * cpu activity.
+	 */
+	if (swap_prefetch < 3) {
+		/* Purposefully racy, may return false positive */
+		if (test_bit(0, &swapped.busy)) {
+			__clear_bit(0, &swapped.busy);
+			goto out;
+		}
+
+		/*
+		 * above_background_load is expensive so we only perform it
+		 * every SWAP_CLUSTER_MAX prefetched_pages.
+		 * We test to see if we're above_background_load as disk
+		 * activity even at low priority can cause interrupt induced
+		 * scheduling latencies.
+		 */
+		if (!(sp_stat.prefetched_pages % SWAP_CLUSTER_MAX) &&
+		    above_background_load())
+			goto out;
+	}
+	clear_current_prefetch_free();
+
+	/*
+	 * Have some hysteresis between where page reclaiming and prefetching
+	 * will occur to prevent ping-ponging between them.
+	 */
+	for_each_zone(z) {
+		struct node_stats *ns;
+		unsigned long free;
+		int idx;
+
+		if (!populated_zone(z))
+			continue;
+
+		node = zone_to_nid(z);
+		ns = &sp_stat.node[node];
+		idx = zone_idx(z);
+
+		free = zone_page_state(z, NR_FREE_PAGES);
+		if (free < *ns->pointfree[idx]) {
+			/*
+			 * Free pages have dropped below the low watermark so
+			 * we won't start prefetching again till we hit the
+			 * high watermark of free pages.
+			 */
+			ns->pointfree[idx] = &ns->highfree[idx];
+			node_clear(node, sp_stat.prefetch_nodes);
+			continue;
+		}
+		ns->current_free += free;
+	}
+
+	/*
+	 * We iterate over each node testing to see if it is suitable for
+	 * prefetching and clear the nodemask if it is not.
+	 */
+	for_each_node_mask(node, sp_stat.prefetch_nodes) {
+		struct node_stats *ns = &sp_stat.node[node];
+
+		/*
+		 * We check to see that pages are not being allocated
+		 * elsewhere at any significant rate implying any
+		 * degree of memory pressure (eg during file reads)
+		 */
+		if (ns->last_free) {
+			if (ns->current_free + SWAP_CLUSTER_MAX <
+			    ns->last_free) {
+				ns->last_free = ns->current_free;
+				node_clear(node,
+					sp_stat.prefetch_nodes);
+				continue;
+			}
+		} else
+			ns->last_free = ns->current_free;
+
+		/* We shouldn't prefetch when we are doing writeback */
+		if (node_page_state(node, NR_WRITEBACK))
+			node_clear(node, sp_stat.prefetch_nodes);
+	}
+
+	/* Nothing suitable, put kprefetchd back to sleep */
+	if (nodes_empty(sp_stat.prefetch_nodes))
+		return TRICKLE_FAILED;
+
+	/* Survived all that? Hooray we can prefetch! */
+	ret = TRICKLE_SUCCESS;
+out:
+	return ret;
+}
+
+/*
+ * trickle_swap is the main function that initiates the swap prefetching. It
+ * first checks to see if the busy flag is set, and does not prefetch if it
+ * is, as the flag implied we are low on memory or swapping in currently.
+ * Otherwise it runs until prefetch_suitable fails which occurs when the
+ * vm is busy, we prefetch to the watermark, the list is empty or we have
+ * iterated over all entries once.
+ */
+static enum trickle_return trickle_swap(void)
+{
+	enum trickle_return suitable, ret = TRICKLE_DELAY;
+	struct swapped_entry *pos, *n;
+	unsigned long flags;
+
+	if (!prefetch_enabled())
+		return ret;
+
+	examine_free_limits();
+	suitable = prefetch_suitable();
+	if (suitable != TRICKLE_SUCCESS)
+		return suitable;
+	if (list_empty(&swapped.list)) {
+		kprefetchd_awake = 0;
+		return TRICKLE_FAILED;
+	}
+
+	spin_lock_irqsave(&swapped.lock, flags);
+	list_for_each_entry_safe_reverse(pos, n, &swapped.list, swapped_list) {
+		swp_entry_t swp_entry;
+		int node;
+
+		spin_unlock_irqrestore(&swapped.lock, flags);
+		cond_resched();
+		suitable = prefetch_suitable();
+		if (suitable != TRICKLE_SUCCESS) {
+			ret = suitable;
+			goto out_unlocked;
+		}
+
+		spin_lock_irqsave(&swapped.lock, flags);
+		if (unlikely(!pos))
+			continue;
+		node = get_swap_entry_node(pos);
+		if (!node_isset(node, sp_stat.prefetch_nodes)) {
+			/*
+			 * We found an entry that belongs to a node that is
+			 * not suitable for prefetching so skip it.
+			 */
+			continue;
+		}
+		swp_entry = pos->swp_entry;
+		spin_unlock_irqrestore(&swapped.lock, flags);
+
+		if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY)
+			goto out_unlocked;
+		spin_lock_irqsave(&swapped.lock, flags);
+	}
+	spin_unlock_irqrestore(&swapped.lock, flags);
+
+out_unlocked:
+	if (sp_stat.prefetched_pages) {
+		lru_add_drain();
+		sp_stat.prefetched_pages = 0;
+	}
+	return ret;
+}
+
+static int kprefetchd(void *__unused)
+{
+	struct sched_param param = { .sched_priority = 0 };
+
+	sched_setscheduler(current, SCHED_BATCH, &param);
+	set_user_nice(current, 19);
+	/* Set ioprio to lowest if supported by i/o scheduler */
+	sys_ioprio_set(IOPRIO_WHO_PROCESS, IOPRIO_BE_NR - 1, IOPRIO_CLASS_BE);
+
+	while (!kthread_should_stop()) {
+		try_to_freeze();
+
+		if (!kprefetchd_awake) {
+			set_current_state(TASK_INTERRUPTIBLE);
+			schedule();
+			kprefetchd_awake = 1;
+		}
+
+		if (trickle_swap() == TRICKLE_FAILED)
+			schedule_timeout_interruptible(PREFETCH_SLEEP);
+		else
+			schedule_timeout_interruptible(PREFETCH_DELAY);
+		clear_last_prefetch_free();
+	}
+	return 0;
+}
+
+/*
+ * Create kmem cache for swapped entries
+ */
+void __init prepare_swap_prefetch(void)
+{
+	struct zone *zone;
+
+	swapped.cache = kmem_cache_create("swapped_entry",
+		sizeof(struct swapped_entry), 0, SLAB_PANIC, NULL);
+
+	/*
+	 * We set the limit to more entries than the physical ram.
+	 * We remove entries lazily so we need some headroom.
+	 */
+	swapped.maxcount = nr_free_pagecache_pages() * 2;
+
+	for_each_zone(zone) {
+		struct node_stats *ns;
+		int idx;
+
+		if (!populated_zone(zone))
+			continue;
+
+		ns = &sp_stat.node[zone_to_nid(zone)];
+		idx = zone_idx(zone);
+		ns->pointfree[idx] = &ns->highfree[idx];
+	}
+}
+
+static int __init kprefetchd_init(void)
+{
+	kprefetchd_task = kthread_run(kprefetchd, NULL, "kprefetchd");
+
+	return 0;
+}
+
+static void __exit kprefetchd_exit(void)
+{
+	kthread_stop(kprefetchd_task);
+}
+
+module_init(kprefetchd_init);
+module_exit(kprefetchd_exit);
diff -puN mm/swap_state.c~mm-implement-swap-prefetching mm/swap_state.c
--- a/mm/swap_state.c~mm-implement-swap-prefetching
+++ a/mm/swap_state.c
@@ -10,6 +10,7 @@
 #include <linux/mm.h>
 #include <linux/kernel_stat.h>
 #include <linux/swap.h>
+#include <linux/swap-prefetch.h>
 #include <linux/init.h>
 #include <linux/pagemap.h>
 #include <linux/buffer_head.h>
@@ -95,7 +96,7 @@ static int __add_to_swap_cache(struct pa
 	return error;
 }
 
-static int add_to_swap_cache(struct page *page, swp_entry_t entry)
+int add_to_swap_cache(struct page *page, swp_entry_t entry)
 {
 	int error;
 
@@ -151,6 +152,9 @@ int add_to_swap(struct page * page, gfp_
 	swp_entry_t entry;
 	int err;
 
+	/* Swap prefetching is delayed if we're swapping pages */
+	delay_swap_prefetch();
+
 	BUG_ON(!PageLocked(page));
 
 	for (;;) {
@@ -323,6 +327,9 @@ struct page *read_swap_cache_async(swp_e
 	struct page *found_page, *new_page = NULL;
 	int err;
 
+	/* Swap prefetching is delayed if we're already reading from swap */
+	delay_swap_prefetch();
+
 	do {
 		/*
 		 * First check the swap cache.  Since this is normally
diff -puN mm/vmscan.c~mm-implement-swap-prefetching mm/vmscan.c
--- a/mm/vmscan.c~mm-implement-swap-prefetching
+++ a/mm/vmscan.c
@@ -16,6 +16,7 @@
 #include <linux/slab.h>
 #include <linux/kernel_stat.h>
 #include <linux/swap.h>
+#include <linux/swap-prefetch.h>
 #include <linux/pagemap.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
@@ -1231,6 +1232,7 @@ unsigned long try_to_free_pages(struct z
 		.order = order,
 	};
 
+	delay_swap_prefetch();
 	count_vm_event(ALLOCSTALL);
 
 	for (i = 0; zones[i] != NULL; i++) {
@@ -1677,6 +1679,8 @@ unsigned long shrink_all_memory(unsigned
 		.swappiness = vm_swappiness,
 	};
 
+	delay_swap_prefetch();
+
 	current->reclaim_state = &reclaim_state;
 
 	lru_pages = count_lru_pages();
_

Patches currently in -mm which might be from kernel@xxxxxxxxxxx are

mm-implement-swap-prefetching.patch

-
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html