Re: [LSF/MM TOPIC] plans for future swap changes

Johannes Weiner <hannes@xxxxxxxxxxx> · Wed, 4 Jan 2017 01:40:24 -0500

Hi,

On Wed, Dec 28, 2016 at 03:57:32PM +0100, Michal Hocko wrote:
> This is something I would be interested to discuss even though I am not
> working on it directly. Sorry if I hijacked the topic from those who
> planned to post them.
> 
> It seems that the time to reconsider our approach to the swap storage is
> come already and there are multiple areas to discuss. I would be
> interested at least in the following
> 1) anon/file balancing. Johannes has posted some work already and I am
>    really interested in the future plans for it.

They needed some surgery to work on top of the node-LRU rewrite. I've
restored performance on the benchmarks I was using and will post them
after some more cleaning up and writing changelogs for the new pieces.

> 2) swap trashing detection is something that we are lacking for a long
>    time and it would be great if we could do something to help
>    situations when the machine is effectively out of memory but still
>    hopelessly trying to swap in and out few pages while the machine is
>    basically unusable. I hope that 1) will give us some bases but I am
>    not sure how much we will need on top.

Yes, this keeps biting us quite frequently. Not with swap so much as
page cache, but it's the same problem: while we know all the thrashing
*events*, we don't know how much they truly cost us. I've started
drafting a thrashing quantification patch based on feedback from the
Kernel Summit, attaching it below. It's unbelievably crude and needs
more thought on sampling/decaying, as well as on filtering out swapins
that happen after pressure has otherwise subsided. But it does give me
a reasonable-looking thrashing ratio under memory pressure.

> 3) optimizations for the swap out paths - Tim Chen and other guys from
>    Intel are already working on this. I didn't get time to review this
>    closely - mostly because I am not closely familiar with the swapout
>    code and it takes quite some time to get into all subtle details.
>    I mainly interested in what are the plans in this area and how they
>    should be coordinated with other swap related changes
> 4) Do we want the native THP swap in/out support?

Shaohua had some opinions on this, he might be interested in joining
this discussion. CCing him.

---

>From 8e3e24a35932af85392c238881b2c3a647f09024 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@xxxxxxxxxxx>
Date: Sat, 3 Dec 2016 10:30:58 -0500
Subject: [PATCH] mm: vmstat: count task actual runtime & task memory delay
 time

This gives us an idea of how much of a task's overall time was spent
reclaiming, waiting for other reclaimers, waiting for refaults vs.
actual time spent executing.

>From here:

- Swapping in on an idle system. That's not thrashing.

- p50, p95, p99 latencies of the task overall?

- How much was the memory cgroup delayed?

- How much was the system delayed?

Per-task accounting alone is not too interesting. It needs to be per
group or system-wide. Think of a shell script, where the end goal that
the user is waiting for involves many forks and short-lived tasks. If
each task gets stalled for just a little bit, it adds up to something
severe. It's jobs and the system that matters.

Accumulate actual CPU time R of a task in the task's memcg. Accumulate
the thrash time T that the task spends running and sleeping in page
reclaim and lock_page(WS) in the task's memcg. Signal memory pressure
and OOM based on T / (R + T).

schedule:
  if !PF_MEMDELAY:
    flush runtime since scheduled into prev->memcg->R

try_to_free_pages:
  memdelay_start()
  ...
  memdelay_end()

lock_page(refaulting_page):
  memdelay_start()
  ...
  memdelay_end()

memdelay_start:
  flush runtime since scheduled current->memcg->R
  set PF_MEMDELAY

memdelay_end:
  flush runtime since memdelay_start() into current->memcg->T
  clear PF_MEMDELAY

memcg->R and memcg->T need to have a sane sampling period (a single IO
wait in a short-lived program can easily be 99% of R+T) and then decay
beyond that.

For a global view, sum R and T from all memcgs in the system before
taking the ratio of the sums.

If one group exceeds, walk up the tree and recursively calculate the
ratios to signal memory pressure or OOM requirements.
---
 include/linux/sched.h  |   4 ++
 include/linux/vmstat.h |   3 ++
 kernel/sched/fair.c    |   3 ++
 mm/filemap.c           |  70 +++++++++++++++++++++-----
 mm/page_alloc.c        |   4 ++
 mm/swap.c              |   4 --
 mm/vmscan.c            |   2 +-
 mm/vmstat.c            | 132 +++++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 206 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b0ec92..da72db0467e9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1652,6 +1652,9 @@ struct task_struct {
 /* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
 	unsigned long min_flt, maj_flt;
 
+	u64 memdelay_runtime;		/* Accrued CPU time since last thrashing (or fork) */
+	u64 memdelay_begin;		/* Time when current thrashing period began */
+
 	struct task_cputime cputime_expires;
 	struct list_head cpu_timers[3];
 
@@ -2276,6 +2279,7 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
 #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
+#define PF_MEMDELAY	0x01000000	/* Delayed by lack of memory */
 #define PF_NO_SETAFFINITY 0x04000000	/* Userland is not allowed to meddle with cpus_allowed */
 #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 613771909b6e..d4fd4d777c20 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -351,4 +351,7 @@ static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
 
 extern const char * const vmstat_text[];
 
+void memdelay_start(void);
+void memdelay_end(void);
+
 #endif /* _LINUX_VMSTAT_H */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c242944f5cbd..69d27f728fa5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -825,6 +825,9 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
 		cpuacct_charge(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
+
+		if (!(curtask->flags & PF_MEMDELAY))
+			curtask->memdelay_runtime += delta_exec;
 	}
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
diff --git a/mm/filemap.c b/mm/filemap.c
index 2dd472c0f2ca..56152cab1ef5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -742,35 +742,66 @@ EXPORT_SYMBOL(page_waitqueue);
 
 void wait_on_page_bit(struct page *page, int bit_nr)
 {
+	bool refault = bit_nr == PG_locked && PageWorkingset(page);
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
 
-	if (test_bit(bit_nr, &page->flags))
-		__wait_on_bit(page_waitqueue(page), &wait, bit_wait_io,
-							TASK_UNINTERRUPTIBLE);
+	if (!test_bit(bit_nr, &page->flags))
+		return;
+
+	if (refault)
+		memdelay_start();
+
+	__wait_on_bit(page_waitqueue(page), &wait, bit_wait_io,
+		      TASK_UNINTERRUPTIBLE);
+
+	if (refault)
+		memdelay_end();
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
 int wait_on_page_bit_killable(struct page *page, int bit_nr)
 {
+	bool refault = bit_nr == PG_locked && PageWorkingset(page);
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+	int ret;
 
 	if (!test_bit(bit_nr, &page->flags))
 		return 0;
 
-	return __wait_on_bit(page_waitqueue(page), &wait,
-			     bit_wait_io, TASK_KILLABLE);
+	if (refault)
+		memdelay_start();
+
+	ret = __wait_on_bit(page_waitqueue(page), &wait,
+			    bit_wait_io, TASK_KILLABLE);
+
+	if (refault)
+		memdelay_end();
+
+	return ret;
 }
 
 int wait_on_page_bit_killable_timeout(struct page *page,
 				       int bit_nr, unsigned long timeout)
 {
+	bool refault = bit_nr == PG_locked && PageWorkingset(page);
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+	int ret;
 
 	wait.key.timeout = jiffies + timeout;
+
 	if (!test_bit(bit_nr, &page->flags))
 		return 0;
-	return __wait_on_bit(page_waitqueue(page), &wait,
-			     bit_wait_io_timeout, TASK_KILLABLE);
+
+	if (refault)
+		memdelay_start();
+
+	ret = __wait_on_bit(page_waitqueue(page), &wait,
+			    bit_wait_io_timeout, TASK_KILLABLE);
+
+	if (refault)
+		memdelay_end();
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(wait_on_page_bit_killable_timeout);
 
@@ -871,21 +902,38 @@ EXPORT_SYMBOL_GPL(page_endio);
  */
 void __lock_page(struct page *page)
 {
+	bool refault = PageWorkingset(page);
 	struct page *page_head = compound_head(page);
 	DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked);
 
-	__wait_on_bit_lock(page_waitqueue(page_head), &wait, bit_wait_io,
-							TASK_UNINTERRUPTIBLE);
+	if (refault)
+		memdelay_start();
+
+	__wait_on_bit_lock(page_waitqueue(page_head), &wait,
+			   bit_wait_io, TASK_UNINTERRUPTIBLE);
+
+	if (refault)
+		memdelay_end();
 }
 EXPORT_SYMBOL(__lock_page);
 
 int __lock_page_killable(struct page *page)
 {
+	int ret;
+	bool refault = PageWorkingset(page);
 	struct page *page_head = compound_head(page);
 	DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked);
 
-	return __wait_on_bit_lock(page_waitqueue(page_head), &wait,
-					bit_wait_io, TASK_KILLABLE);
+	if (refault)
+		memdelay_start();
+
+	ret = __wait_on_bit_lock(page_waitqueue(page_head), &wait,
+				 bit_wait_io, TASK_KILLABLE);
+
+	if (refault)
+		memdelay_end();
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 072d791dce2d..c470b8fe28cf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3121,10 +3121,12 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	if (!order)
 		return NULL;
 
+	memdelay_start();
 	current->flags |= PF_MEMALLOC;
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
 									prio);
 	current->flags &= ~PF_MEMALLOC;
+	memdelay_end();
 
 	if (*compact_result <= COMPACT_INACTIVE)
 		return NULL;
@@ -3266,6 +3268,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
+	memdelay_start();
 	current->flags |= PF_MEMALLOC;
 	lockdep_set_current_reclaim_state(gfp_mask);
 	reclaim_state.reclaimed_slab = 0;
@@ -3277,6 +3280,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 	current->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
 	current->flags &= ~PF_MEMALLOC;
+	memdelay_end();
 
 	cond_resched();
 
diff --git a/mm/swap.c b/mm/swap.c
index ece018771488..9d59739f34f5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -280,7 +280,6 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		int lru = page_lru_base_type(page);
 
 		del_page_from_lru_list(page, lruvec, lru);
-		SetPageWorkingset(page);
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(page, lruvec, lru);
@@ -858,7 +857,6 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 {
 	unsigned int nr_pages = hpage_nr_pages(page);
 	enum lru_list lru = page_lru(page);
-	bool active = is_active_lru(lru);
 	bool file = is_file_lru(lru);
 	bool new = (bool)arg;
 
@@ -874,8 +872,6 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 		 */
 		if (PageWorkingset(page))
 			lru_note_cost(lruvec, COST_IO, file, nr_pages);
-		else if (active)
-			SetPageWorkingset(page);
 	}
 
 	trace_mm_lru_insertion(page, lru);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8861d134e604..1b112f291fbe 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1242,7 +1242,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
 			try_to_free_swap(page);
 		VM_BUG_ON_PAGE(PageActive(page), page);
-		SetPageWorkingset(page);
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
@@ -1960,6 +1959,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		}
 
 		ClearPageActive(page);	/* we are de-activating */
+		SetPageWorkingset(page);
 		list_add(&page->lru, &l_inactive);
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8a88a65bfb13..c9c68034ac36 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1965,3 +1965,135 @@ static int __init extfrag_debug_init(void)
 
 module_init(extfrag_debug_init);
 #endif
+
+static DEFINE_SPINLOCK(lock);
+
+static u64 memdelay_runtime;
+static u64 memdelay_waitime;
+
+static u64 memdelay_main_clock;
+static u64 memdelay_idle_clock;
+
+static bool memdelay_oomwatch_armed;
+static u64 memdelay_oomwatch_begin;
+
+static int memdelay_percentage(void)
+{
+	return memdelay_waitime * 100 / max(memdelay_runtime + memdelay_waitime, 1ULL);
+}
+
+static bool memdelay_update_clock(bool begin_idle)
+{
+	u64 now = ktime_get_ns();
+
+	/* Quiescent state 1s after last thrashing event */
+	if (begin_idle) {
+		memdelay_idle_clock = now + 1000000000;
+	} else if (memdelay_idle_clock < now) {
+		memdelay_runtime = 0;
+		memdelay_waitime = 0;
+	}
+
+	/* Decay history gradually */
+	while (memdelay_main_clock < now) {
+		if (!memdelay_runtime && !memdelay_waitime) {
+			memdelay_main_clock = now + 1000000000;
+			break;
+		}
+		memdelay_runtime -= memdelay_runtime / 4;
+		memdelay_waitime -= memdelay_waitime / 4;
+		memdelay_main_clock += 1000000000;
+	}
+
+	/* OOM kill persistent thrashing */
+	/* XXX: idle swap in isn't thrashing */
+	if (memdelay_percentage() > 50) {
+		if (!memdelay_oomwatch_armed) {
+			memdelay_oomwatch_armed = true;
+			memdelay_oomwatch_begin = now;
+		} else if (now > memdelay_oomwatch_begin + 5000000000) {
+			memdelay_oomwatch_begin = now;
+			return true;
+		}
+	} else if (memdelay_oomwatch_armed) {
+		memdelay_oomwatch_armed = false;
+	}
+	return false;
+}
+
+void memdelay_start(void)
+{
+	unsigned long flags;
+	bool oom;
+
+	VM_BUG_ON(current->flags & PF_MEMDELAY);
+
+	/* Advance sched clock to end of runtime, save current->memdelay_runtime */
+	yield();
+	current->flags |= PF_MEMDELAY;
+
+	/* Accumulate successful runtime */
+	spin_lock_irqsave(&lock, flags);
+	oom = memdelay_update_clock(false);
+	memdelay_runtime += current->memdelay_runtime;
+	current->memdelay_runtime = 0;
+	spin_unlock_irqrestore(&lock, flags);
+
+	if (oom)
+		pagefault_out_of_memory();
+
+	current->memdelay_begin = ktime_get_ns();
+}
+
+void memdelay_end(void)
+{
+	unsigned long flags;
+	bool oom;
+
+	VM_BUG_ON(!(current->flags & PF_MEMDELAY));
+
+	/* Advance sched clock to end of thrashing time */
+	yield();
+	current->flags &= ~PF_MEMDELAY;
+
+	/* Accumulate thrashing time */
+	spin_lock_irqsave(&lock, flags);
+	memdelay_waitime += ktime_get_ns() - current->memdelay_begin;
+	oom = memdelay_update_clock(true);
+	spin_unlock_irqrestore(&lock, flags);
+
+	if (oom)
+		pagefault_out_of_memory();
+}
+
+static int memdelay_read(void *data, u64 *val)
+{
+	unsigned long flags;
+	bool oom;
+	spin_lock_irqsave(&lock, flags);
+	oom = memdelay_update_clock(false);
+	*val = memdelay_percentage();
+	spin_unlock_irqrestore(&lock, flags);
+	if (oom)
+		pagefault_out_of_memory();
+	return 0;
+}
+DEFINE_DEBUGFS_ATTRIBUTE(memdelay_state, memdelay_read, NULL, "%llu%%\n");
+
+static int __init memdelay_init(void)
+{
+	debugfs_create_file("memdelay", S_IFREG|S_IRUGO, NULL, NULL, &memdelay_state);
+	/*
+	 * XXX: Hung task watchdog should check percentage of each
+	 * cgroup in the system, including root. If any of them
+	 * persists thrashing at 60+ percent for several seconds, the
+	 * OOM killer needs to be invoked.
+	 *
+	 * Which OOM killer? If a group is thrashing and its reclaim
+	 * happens due to a local limit, call that group's OOM killer.
+	 * If a group is thrashing but the reclaim is from the parent,
+	 * kill within the parent's hierarchy, etc.
+	 */
+	return 0;
+}
+module_init(memdelay_init);
-- 
2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>