[PATCH] mm, swap: VMA based swap readahead

"Huang, Ying" <ying.huang@xxxxxxxxx> · Tue, 14 Mar 2017 17:25:29 +0800

From: Huang Ying <ying.huang@xxxxxxxxx>

The swap readahead is a important mechanism to reduce the swap in
latency.  But the original swap readahead algorithm has some issues.

a) The original swap readahead algorithm does readahead based on the
   consecutive blocks in swap device.  But the consecutive blocks in
   swap device just reflect the order of page reclaiming, but don't
   necessarily reflect the access sequence in RAM.  Although for some
   workloads (such as whole system pure sequential workload), it
   works, but in general, it doesn't work well for more complex access
   pattern, such as combined workloads (sequential and random workload
   runs together) or changing accessing patterns in workload.

b) The original swap readahead algorithm just turns random read into
   sequential read, but doesn’t parallel CPU and disk operations.
   This can reduce the average latency and the lowest latency, but
   doesn’t help much for high percentile latency (like 90%, 95%,
   99%).

In this patch, a VMA based swap readahead algorithm is implemented.
When the page fault occurs, the pages along the access direction will
be checked and the swapped out pages will be readahead if they fall
inside the readahead window.  There is one readahead window state for
each VMA, to reflect the different access patterns for different VMAs.
The readahead window is scaled based on whether the accessed page is
consecutive in virtual address with the previous accessed page.  If
so, the readahead window will be extended, otherwise, the readahead
window will be shrunk rapidly.

The test and comparison result is as below.

Common test condition
=====================

Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
Swap device: NVMe disk

Pure sequential access pattern workload
=======================================

Test case
---------

vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory write until 300
seconds, for each page accessed, 24us is delayed to emulate CPU
operations on the page.  The first round writing will trigger swap
out, the following rounds will trigger sequential swap in and out.

Result
------

Base kernel:

samples: 100952, average: 40.08us, min: 0us, max: 16.2ms
50th: 1us, 60th: 1us, 70th: 1us, 80th: 3us, 90th: 158us, 95th: 181us, 99th: 254us

Optimized kernel:

samples: 118509, average: 12.04us, min: 0us, max: 17.42ms
50th: 1us, 60th: 2us, 70th: 2us, 80th: 2us, 90th: 8us, 95th: 11us, 99th: 42us

The VMA based swap readahead has much lower latency.  Mainly because
VMA based swap readahead algorithm parallels CPU operations and disk
IO, while the original swap readahead algorithm doesn’t do that.

Pure random access pattern workload
===================================

Test Case
---------

vm-scalability, random swap test case, 4 processes to eat 50G virtual
memory space, repeat the memory random write until 300 seconds, for
each page accessed, 24us is delayed to emulate CPU operations on the
page.  The first round writing will trigger swap out, the following
rounds will trigger random swap in.

Result
------

Base kernel:

samples: 70757, average: 70.71us, min: 0us, max: 15.19ms
50th: 78us, 60th: 83us, 70th: 91us, 80th: 93us, 90th: 97us, 95th: 103us, 99th: 120us

Optimized kernel:

samples: 70842, average: 72.55us, min: 0us, max: 15.47ms
50th: 79us, 60th: 83us, 70th: 92us, 80th: 94us, 90th: 97us, 95th: 104us, 99th: 122us

Almost exact same for all results.  Both are good at avoiding readahead.

Combined access pattern workload
================================

Test Case
---------

vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory write until 300
seconds, for each page accessed, 24us is delayed to emulate CPU
operations on the page.  The first round writing will trigger swap
out, the following round will trigger sequential swap in and out.

As background noise, vm-scalability, random swap test case, 8
processes to eat 30G virtual memory space, repeat the random memory
write until 300 seconds, for each page accessed, 24us is delayed to
emulate CPU operations on the page.  It will trigger random swap in in
the background.

All in all, after reaching relative stable state, there are sequential
and random swap in.

Result (sequential workload)
----------------------------

Base kernel:

samples: 228658, average: 68.52us, min: 0us, max: 23.67ms
50th: 2us, 60th: 42us, 70th: 74us, 80th: 106us, 90th: 150us, 95th: 181us, 99th: 263us

Optimized kernel:

samples: 271851, average: 26.8us, min: 0us, max: 21.24ms
50th: 1us, 60th: 1us, 70th: 2us, 80th: 3us, 90th: 58us, 95th: 115us, 99th: 225us

The original swap readahead algorithm is confused by the background
random access workload, so readahead is not enough for the sequential
access workload.  The VMA-base readahead algorithm works much better.

Signed-off-by: Huang Ying <ying.huang@xxxxxxxxx>
---
 include/linux/mm_types.h |   1 +
 include/linux/swap.h     |  51 +++++++++
 mm/memory.c              |  15 ++-
 mm/swap_state.c          | 287 ++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 345 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f60f45fe226f..cee526f8b08d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,6 +335,7 @@ struct vm_area_struct {
 	struct file * vm_file;		/* File we map to (can be NULL). */
 	void * vm_private_data;		/* was vm_pte (shared mem) */
 
+	atomic_long_t swap_readahead_info;
 #ifndef CONFIG_MMU
 	struct vm_region *vm_region;	/* NOMMU mapping region */
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 486494e6b2fc..89745fb0002b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -250,6 +250,26 @@ struct swap_info_struct {
 	struct swap_cluster_list discard_clusters; /* discard clusters list */
 };
 
+#ifdef CONFIG_64BIT
+#define SWAP_RA_ORDER_CEILING	9
+#else
+/* Avoid stack overflow, because we need to save part of page table */
+#define SWAP_RA_ORDER_CEILING	2
+/* 2 * window - 1 will be checked for readahead */
+#define SWAP_RA_PTE_CACHE_SIZE	((1 << (1 + SWAP_RA_ORDER_CEILING)) - 1)
+#endif
+
+struct vma_swap_readahead {
+	int direction;
+	int win;
+	int nr_pte;
+#ifdef CONFIG_64BIT
+	pte_t *ptes;
+#else
+	pte_t ptes[SWAP_RA_PTE_CACHE_SIZE];
+#endif
+};
+
 /* linux/mm/workingset.c */
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 bool workingset_refault(void *shadow);
@@ -348,6 +368,7 @@ int generic_swapfile_activate(struct swap_info_struct *, struct file *,
 #define SWAP_ADDRESS_SPACE_SHIFT	14
 #define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
 extern struct address_space *swapper_spaces[];
+extern bool swap_vma_readahead;
 #define swap_address_space(entry)			    \
 	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
 		>> SWAP_ADDRESS_SPACE_SHIFT])
@@ -369,6 +390,18 @@ extern struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
 extern struct page *swapin_readahead(swp_entry_t, gfp_t,
 			struct vm_area_struct *vma, unsigned long addr);
 
+extern void swap_readahead_detect(struct vm_fault *vmf,
+				  struct vma_swap_readahead *swap_ra);
+extern struct page *do_swap_page_readahead(struct vm_fault *vmf,
+					   struct vma_swap_readahead *swap_ra,
+					   swp_entry_t fentry,
+					   struct page *fpage);
+
+static inline bool swap_use_vma_readahead(void)
+{
+	return READ_ONCE(swap_vma_readahead);
+}
+
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
@@ -466,6 +499,24 @@ static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 	return NULL;
 }
 
+static inline bool swap_use_vma_readahead(void)
+{
+	return false;
+}
+
+static inline void swap_readahead_detect(struct vm_fault *vmf,
+					 struct vma_swap_readahead *swap_ra)
+{
+}
+
+static inline struct page *do_swap_page_readahead(struct vm_fault *vmf,
+						  struct vma_swap_readahead *swap_ra,
+						  swp_entry_t fentry,
+						  struct page *fpage)
+{
+	return NULL;
+}
+
 static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
 {
 	return 0;
diff --git a/mm/memory.c b/mm/memory.c
index fe57ef2996b5..e7c22e17e1fa 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2562,12 +2562,16 @@ int do_swap_page(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page, *swapcache;
 	struct mem_cgroup *memcg;
+	struct vma_swap_readahead swap_ra;
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
 	int exclusive = 0;
 	int ret = 0;
+	bool vma_readahead = swap_use_vma_readahead();
 
+	if (vma_readahead)
+		swap_readahead_detect(vmf, &swap_ra);
 	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
 		goto out;
 
@@ -2587,8 +2591,12 @@ int do_swap_page(struct vm_fault *vmf)
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
 	page = lookup_swap_cache(entry);
 	if (!page) {
-		page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vma,
-					vmf->address);
+		if (vma_readahead)
+			page = do_swap_page_readahead(vmf, &swap_ra,
+						      entry, NULL);
+		else
+			page = swapin_readahead(entry,
+				GFP_HIGHUSER_MOVABLE, vma, vmf->address);
 		if (!page) {
 			/*
 			 * Back out if somebody else faulted in this pte
@@ -2615,7 +2623,8 @@ int do_swap_page(struct vm_fault *vmf)
 		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 		swapcache = page;
 		goto out_release;
-	}
+	} else if (vma_readahead)
+		do_swap_page_readahead(vmf, &swap_ra, entry, page);
 
 	swapcache = page;
 	locked = swap_lock_page_or_retry(page, vma->vm_mm, vmf->flags);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 473b71e052a8..2e74b2d0228c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -36,6 +36,7 @@ static const struct address_space_operations swap_aops = {
 
 struct address_space *swapper_spaces[MAX_SWAPFILES];
 static unsigned int nr_swapper_spaces[MAX_SWAPFILES];
+bool swap_vma_readahead = true;
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
 
@@ -295,13 +296,15 @@ struct page * lookup_swap_cache(swp_entry_t entry)
 
 	page = find_get_page(swap_address_space(entry), swp_offset(entry));
 
-	if (page) {
-		INC_CACHE_INFO(find_success);
-		if (TestClearPageReadahead(page))
-			atomic_inc(&swapin_readahead_hits);
-	}
+	if (!swap_vma_readahead) {
+		if (page) {
+			INC_CACHE_INFO(find_success);
+			if (TestClearPageReadahead(page))
+				atomic_inc(&swapin_readahead_hits);
+		}
 
-	INC_CACHE_INFO(find_total);
+		INC_CACHE_INFO(find_total);
+	}
 	return page;
 }
 
@@ -561,3 +564,275 @@ void exit_swap_address_space(unsigned int type)
 	synchronize_rcu();
 	kvfree(spaces);
 }
+
+#ifdef CONFIG_64BIT
+#define SWAP_RA_ORDER_DEFAULT	3
+#else
+#define SWAP_RA_ORDER_DEFAULT	2
+#endif
+
+static int swap_ra_order_max = SWAP_RA_ORDER_DEFAULT;
+
+#define SWAP_RA_ADDR(v)		((v) & PAGE_MASK)
+#define SWAP_RA_HITS(v)		((v) & (PAGE_SIZE - 1))
+
+#define SWAP_RA_VAL(addr, hits)	((addr) | (hits))
+
+static inline int swap_ra_hits_max(int max_order)
+{
+	return 2 * max_order;
+}
+
+static inline int swap_ra_hits_to_order(int hits)
+{
+	return hits / 2;
+}
+
+static inline int swap_ra_detect_window(int max_order)
+{
+	return 1 << (max_order + PAGE_SHIFT);
+}
+
+static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma,
+				     unsigned long faddr,
+				     unsigned long addr,
+				     unsigned long *start,
+				     unsigned long *end)
+{
+	*start = max3(min(addr, faddr) + PAGE_SIZE,
+		      vma->vm_start, faddr & PMD_MASK) >> PAGE_SHIFT;
+	*end = min3(max(addr, faddr),
+		    vma->vm_end, (faddr & PMD_MASK) + PMD_SIZE) >> PAGE_SHIFT;
+}
+
+void swap_readahead_detect(struct vm_fault *vmf,
+			   struct vma_swap_readahead *swap_ra)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long swap_ra_info;
+	unsigned long addr, faddr;
+	unsigned long start, end, pfn;
+	pte_t *pte;
+	int max_order, hits, hit;
+#ifndef CONFIG_64BIT
+	pte_t *tpte;
+#endif
+
+	swap_ra_info = atomic_long_read(&vma->swap_readahead_info);
+	max_order = READ_ONCE(swap_ra_order_max);
+
+	addr = SWAP_RA_ADDR(swap_ra_info);
+	hits = SWAP_RA_HITS(swap_ra_info);
+	faddr = vmf->address & PAGE_MASK;
+	swap_ra->direction = faddr >= addr ? 1 : -1;
+
+	if (faddr == addr)
+		hit = 0;
+	else if (addr < vma->vm_start || addr >= vma->vm_end)
+		hit = -1;
+	else if (abs((long)(faddr - addr)) < swap_ra_detect_window(max_order)) {
+		/*
+		 * If all virtual pages from the last to the current
+		 * fault address are presented, it is sequential
+		 * access.  But if cross PMD, the readahead window
+		 * size will be kept at best.
+		 */
+		hit = (addr & PMD_MASK) == (faddr & PMD_MASK);
+		swap_ra_clamp_pfn(vma, faddr, addr, &start, &end);
+		pte = vmf->pte - ((faddr >> PAGE_SHIFT) - start);
+		for (pfn = start; pfn != end; pfn++, pte++)
+			if (!pte_present(*pte)) {
+				hit = -1;
+				break;
+			}
+	} else
+		hit = -1;
+
+	/*
+	 * If failed in the sequential access detection, shrink the
+	 * readahead window rapidly.  Otherwise keep or enlarge
+	 * readahead window.
+	 */
+	if (hit < 0)
+		hits /= 2;
+	else
+		hits = min(hits + hit, swap_ra_hits_max(max_order));
+
+	/* Update readahead information */
+	atomic_long_set(&vma->swap_readahead_info, SWAP_RA_VAL(faddr, hits));
+	swap_ra->win = 1 << swap_ra_hits_to_order(hits);
+
+	/* No readahead */
+	if (swap_ra->win == 1) {
+		swap_ra->nr_pte = 0;
+		return;
+	}
+
+	addr = faddr + (swap_ra->win << (1 + PAGE_SHIFT)) * swap_ra->direction;
+	swap_ra_clamp_pfn(vma, faddr, addr, &start, &end);
+
+	swap_ra->nr_pte = end - start;
+	pte = vmf->pte - ((faddr >> PAGE_SHIFT) - start);
+#ifdef CONFIG_64BIT
+	swap_ra->ptes = pte;
+#else
+	tpte = swap_ra->ptes;
+	for (pfn = start; pfn != end; pfn++)
+		*tpte++ = *pte++;
+#endif
+}
+
+struct page *do_swap_page_readahead(struct vm_fault *vmf,
+				    struct vma_swap_readahead *swap_ra,
+				    swp_entry_t fentry,
+				    struct page *fpage)
+{
+	struct blk_plug plug;
+	struct vm_area_struct *vma = vmf->vma;
+	struct page *page;
+	unsigned long addr;
+	pte_t *pte, pentry;
+	gfp_t gfp_mask;
+	swp_entry_t entry;
+	int i, alloc = 0, count;
+	bool page_allocated;
+
+	addr = vmf->address & PAGE_MASK;
+	blk_start_plug(&plug);
+	if (!fpage) {
+		fpage = __read_swap_cache_async(fentry, GFP_HIGHUSER_MOVABLE,
+						vma, addr, &page_allocated);
+		if (!fpage) {
+			blk_finish_plug(&plug);
+			return NULL;
+		}
+		if (page_allocated) {
+			alloc++;
+			swap_readpage(fpage);
+		}
+	}
+	/* fault page has been checked */
+	count = 1;
+	addr += PAGE_SIZE * swap_ra->direction;
+	pte = swap_ra->ptes;
+	if (swap_ra->direction < 0)
+		pte += swap_ra->nr_pte - 1;
+	/* check 2 * window size, but readahead window size at most */
+	for (i = 0; i < swap_ra->nr_pte && alloc < swap_ra->win;
+	     i++, count++, pte += swap_ra->direction,
+		     addr += PAGE_SIZE * swap_ra->direction) {
+		/* Page present for window size, skip readahead */
+		if (!alloc && count >= swap_ra->win)
+			break;
+		pentry = *pte;
+		if (pte_none(pentry))
+			break;
+		if (pte_present(pentry))
+			continue;
+		entry = pte_to_swp_entry(pentry);
+		if (unlikely(non_swap_entry(entry)))
+			break;
+		/* Avoid page allocation latency from readahead */
+		gfp_mask = __GFP_KSWAPD_RECLAIM | __GFP_HIGHMEM |
+			__GFP_MOVABLE | __GFP_HARDWALL | __GFP_NOWARN;
+		page = __read_swap_cache_async(entry, gfp_mask, vma, addr,
+					       &page_allocated);
+		/*
+		 * Memory allocation failure, avoid to put too much overhead
+		 * on memory allocator because of readahead
+		 */
+		if (!page)
+			break;
+		if (page_allocated) {
+			alloc++;
+			swap_readpage(page);
+		}
+		put_page(page);
+	}
+	blk_finish_plug(&plug);
+	/* Push any new pages onto the LRU now */
+	if (alloc)
+		lru_add_drain();
+
+	return fpage;
+}
+
+#ifdef CONFIG_SYSFS
+static ssize_t vma_ra_enabled_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%s\n", swap_vma_readahead ? "true" : "false");
+}
+static ssize_t vma_ra_enabled_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t count)
+{
+	if (!strncmp(buf, "true", 4) || !strncmp(buf, "1", 1))
+		swap_vma_readahead = true;
+	else if (!strncmp(buf, "false", 5) || !strncmp(buf, "0", 1))
+		swap_vma_readahead = false;
+	else
+		return -EINVAL;
+
+	return count;
+}
+static struct kobj_attribute vma_ra_enabled_attr =
+	__ATTR(vma_ra_enabled, 0644, vma_ra_enabled_show,
+	       vma_ra_enabled_store);
+
+static ssize_t vma_ra_max_order_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", swap_ra_order_max);
+}
+static ssize_t vma_ra_max_order_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t count)
+{
+	int err, v;
+
+	err = kstrtoint(buf, 10, &v);
+	if (err || v > UINT_MAX)
+		return -EINVAL;
+
+	swap_ra_order_max = max(min(v, SWAP_RA_ORDER_CEILING), 0);
+
+	return count;
+}
+static struct kobj_attribute vma_ra_max_order_attr =
+	__ATTR(vma_ra_max_order, 0644, vma_ra_max_order_show,
+	       vma_ra_max_order_store);
+
+static struct attribute *swap_attrs[] = {
+	&vma_ra_enabled_attr.attr,
+	&vma_ra_max_order_attr.attr,
+	NULL,
+};
+
+static struct attribute_group swap_attr_group = {
+	.attrs = swap_attrs,
+};
+
+static int __init swap_init_sysfs(void)
+{
+	int err;
+	struct kobject *swap_kobj;
+
+	swap_kobj = kobject_create_and_add("swap", mm_kobj);
+	if (unlikely(!swap_kobj)) {
+		pr_err("failed to create swap kobject\n");
+		return -ENOMEM;
+	}
+	err = sysfs_create_group(swap_kobj, &swap_attr_group);
+	if (err) {
+		pr_err("failed to register swap group\n");
+		goto delete_obj;
+	}
+	return 0;
+
+delete_obj:
+	kobject_put(swap_kobj);
+	return err;
+}
+subsys_initcall(swap_init_sysfs);
+#endif
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>