Re: Detecting page cache trashing state

Johannes Weiner <hannes@xxxxxxxxxxx> · Wed, 25 Oct 2017 13:54:24 -0400

Hi Ruslan,

sorry about the delayed response, I missed the new activity in this
older thread.

On Thu, Sep 28, 2017 at 06:49:07PM +0300, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote:
> Hi Johannes,
> 
> Hopefully I was able to rebase the patch on top v4.9.26 (latest supported
> version by us right now)
> and test a bit.
> The overall idea definitely looks promising, although I have one question on
> usage.
> Will it be able to account the time which processes spend on handling major
> page faults
> (including fs and iowait time) of refaulting page?

That's the main thing it should measure! :)

The lock_page() and wait_on_page_locked() calls are where iowaits
happen on a cache miss. If those are refaults, they'll be counted.

> As we have one big application which code space occupies big amount of place
> in page cache,
> when the system under heavy memory usage will reclaim some of it, the
> application will
> start constantly thrashing. Since it code is placed on squashfs it spends
> whole CPU time
> decompressing the pages and seem memdelay counters are not detecting this
> situation.
> Here are some counters to indicate this:
> 
> 19:02:44        CPU     %user     %nice   %system   %iowait %steal     %idle
> 19:02:45        all      0.00      0.00    100.00      0.00 0.00      0.00
> 
> 19:02:44     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
> pgscand/s pgsteal/s    %vmeff
> 19:02:45     15284.00      0.00    428.00    352.00  19990.00 0.00      0.00
> 15802.00      0.00
> 
> And as nobody actively allocating memory anymore looks like memdelay
> counters are not
> actively incremented:
> 
> [:~]$ cat /proc/memdelay
> 268035776
> 6.13 5.43 3.58
> 1.90 1.89 1.26

How does it correlate with /proc/vmstat::workingset_activate during
that time? It only counts thrashing time of refaults it can actively
detect.

Btw, how many CPUs does this system have? There is a bug in this
version on how idle time is aggregated across multiple CPUs. The error
compounds with the number of CPUs in the system.

I'm attaching 3 bugfixes that go on top of what you have. There might
be some conflicts, but they should be minor variable naming issues.

>From 7318c963a582833d4556c51fc2e1658e00c14e3e Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@xxxxxxxxxxx>
Date: Thu, 5 Oct 2017 12:32:47 -0400
Subject: [PATCH 1/3] mm: memdelay: fix task flags race condition

WARNING: CPU: 35 PID: 2263 at ../include/linux/memcontrol.h:466

This is memcg warning that current->memcg_may_oom is set when it
doesn't expect it to.

The warning came in new with the memdelay patches. They add another
task flag in the same int as memcg_may_oom, but modify it from
try_to_wake_up, from a task that isn't current. This isn't safe.

Move the flag to the other int holding task flags, whose modifications
are serialized through the scheduler locks.

Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
---
 include/linux/sched.h | 2 +-
 kernel/sched/core.c   | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index de15e3c8c43a..d1aa8f4c19ab 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -627,6 +627,7 @@ struct task_struct {
 	unsigned			sched_contributes_to_load:1;
 	unsigned			sched_migrated:1;
 	unsigned			sched_remote_wakeup:1;
+	unsigned			sched_memdelay_requeue:1;
 	/* Force alignment to the next boundary: */
 	unsigned			:0;
 
@@ -651,7 +652,6 @@ struct task_struct {
 	/* disallow userland-initiated cgroup migration */
 	unsigned			no_cgroup_migration:1;
 #endif
-	unsigned			memdelay_migrate_enqueue:1;
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bf105c870da6..b4fa806bf153 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -760,10 +760,10 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 	if (!(flags & ENQUEUE_RESTORE))
 		sched_info_queued(rq, p);
 
-	WARN_ON_ONCE(!(flags & ENQUEUE_WAKEUP) && p->memdelay_migrate_enqueue);
-	if (!(flags & ENQUEUE_WAKEUP) || p->memdelay_migrate_enqueue) {
+	WARN_ON_ONCE(!(flags & ENQUEUE_WAKEUP) && p->sched_memdelay_requeue);
+	if (!(flags & ENQUEUE_WAKEUP) || p->sched_memdelay_requeue) {
 		memdelay_add_runnable(p);
-		p->memdelay_migrate_enqueue = 0;
+		p->sched_memdelay_requeue = 0;
 	} else {
 		memdelay_wakeup(p);
 	}
@@ -2065,8 +2065,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 
 		rq = __task_rq_lock(p, &rf);
 		memdelay_del_sleeping(p);
+		p->sched_memdelay_requeue = 1;
 		__task_rq_unlock(rq, &rf);
-		p->memdelay_migrate_enqueue = 1;
 
 		set_task_cpu(p, cpu);
 	}
-- 
2.14.2

>From 7157c70aed93990f59942d39d1c0d8948164cfe2 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@xxxxxxxxxxx>
Date: Thu, 5 Oct 2017 12:34:49 -0400
Subject: [PATCH 2/3] mm: memdelay: idle time is not productive time

There is an error in the multi-core logic, where memory delay numbers
drop as the number of CPU cores increases. Idle CPUs, even though they
don't host delayed processes, shouldn't contribute to the "not
delayed" bucket. Because that's the baseline for productivity, and
idle CPUs aren't productive.

Do not consider idle CPU time in the productivity baseline.

Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
---
 include/linux/memdelay.h | 3 ++-
 mm/memdelay.c            | 4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/memdelay.h b/include/linux/memdelay.h
index d85d01692610..c49f65338c1f 100644
--- a/include/linux/memdelay.h
+++ b/include/linux/memdelay.h
@@ -24,7 +24,8 @@ enum memdelay_task_state {
  * productivity states of all tasks inside the domain.
  */
 enum memdelay_domain_state {
-	MDS_NONE,		/* No delayed tasks */
+	MDS_IDLE,		/* No tasks */
+	MDS_NONE,		/* Working tasks */
 	MDS_PART,		/* Delayed tasks, working tasks */
 	MDS_FULL,		/* Delayed tasks, no working tasks */
 	NR_MEMDELAY_DOMAIN_STATES,
diff --git a/mm/memdelay.c b/mm/memdelay.c
index c7c32dbb67ac..ea5ede79f044 100644
--- a/mm/memdelay.c
+++ b/mm/memdelay.c
@@ -118,8 +118,10 @@ static void domain_cpu_update(struct memdelay_domain *md, int cpu,
 	else if (mdc->tasks[MTS_DELAYED])
 		state = (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT]) ?
 			MDS_PART : MDS_FULL;
-	else
+	else if (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT])
 		state = MDS_NONE;
+	else
+		state = MDS_IDLE;
 
 	if (mdc->state == state)
 		return;
-- 
2.14.2

>From ea663e42b24871d370b6ddbfbf47c1775a2f9f09 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <jweiner@xxxxxx>
Date: Thu, 28 Sep 2017 10:36:39 -0700
Subject: [PATCH 3/3] mm: memdelay: drop IO as productive time

Counting IO as productive time distorts the sense of pressure with
workloads that are naturally IO-bound. Memory pressure can cause IO,
and thus cause "productive" IO to slow down - yet we don't attribute
this slowdown properly to a shortage of memory.

Disregard IO time altogether, and use CPU time alone as the baseline.

Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
---
 mm/memdelay.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memdelay.c b/mm/memdelay.c
index ea5ede79f044..fbce1d4ba142 100644
--- a/mm/memdelay.c
+++ b/mm/memdelay.c
@@ -113,12 +113,11 @@ static void domain_cpu_update(struct memdelay_domain *md, int cpu,
 	 * one running the workload, the domain is considered fully
 	 * blocked 50% of the time.
 	 */
-	if (mdc->tasks[MTS_DELAYED_ACTIVE] && !mdc->tasks[MTS_IOWAIT])
+	if (mdc->tasks[MTS_DELAYED_ACTIVE])
 		state = MDS_FULL;
 	else if (mdc->tasks[MTS_DELAYED])
-		state = (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT]) ?
-			MDS_PART : MDS_FULL;
-	else if (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT])
+		state = mdc->tasks[MTS_RUNNABLE] ? MDS_PART : MDS_FULL;
+	else if (mdc->tasks[MTS_RUNNABLE])
 		state = MDS_NONE;
 	else
 		state = MDS_IDLE;
-- 
2.14.2