+ psi-aggregate-ongoing-stall-events-when-somebody-reads-pressure.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Thu, 12 Jul 2018 16:55:29 -0700

The patch titled
     Subject: psi: aggregate ongoing stall events when somebody reads pressure
has been added to the -mm tree.  Its filename is
     psi-aggregate-ongoing-stall-events-when-somebody-reads-pressure.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/psi-aggregate-ongoing-stall-events-when-somebody-reads-pressure.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/psi-aggregate-ongoing-stall-events-when-somebody-reads-pressure.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Johannes Weiner <hannes@xxxxxxxxxxx>
Subject: psi: aggregate ongoing stall events when somebody reads pressure

Right now, psi reports pressure and stall times of already concluded stall
events.  For most use cases this is current enough, but certain highly
latency-sensitive applications, like the Android OOM killer, might want to
know about and react to stall states before they have even concluded (e.g.
a prolonged reclaim cycle).

This patches the procfs/cgroupfs interface such that when the pressure
metrics are read, the current per-cpu states, if any, are taken into
account as well.

Any ongoing states are concluded, their time snapshotted, and then
restarted.  This requires holding the rq lock to avoid corruption.  It
could use some form of rq lock ratelimiting or avoidance.

Requested-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>
Not-yet-signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
Link: http://lkml.kernel.org/r/20180712172942.10094-11-hannes@xxxxxxxxxxx
Cc: Christopher Lameter <cl@xxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Johannes Weiner <jweiner@xxxxxx>
Cc: Mike Galbraith <efault@xxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Shakeel Butt <shakeelb@xxxxxxxxxx>
Cc: Tejun Heo <tj@xxxxxxxxxx>
Cc: Vinayak Menon <vinmenon@xxxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---


diff -puN kernel/sched/psi.c~psi-aggregate-ongoing-stall-events-when-somebody-reads-pressure kernel/sched/psi.c

--- a/kernel/sched/psi.c~psi-aggregate-ongoing-stall-events-when-somebody-reads-pressure
+++ a/kernel/sched/psi.c
@@ -190,7 +190,7 @@ static void calc_avgs(unsigned long avg[
 	}
 }
 
-static bool psi_update_stats(struct psi_group *group)
+static bool psi_update_stats(struct psi_group *group, bool ondemand)
 {
 	u64 some[NR_PSI_RESOURCES] = { 0, };
 	u64 full[NR_PSI_RESOURCES] = { 0, };
@@ -200,8 +200,6 @@ static bool psi_update_stats(struct psi_
 	int cpu;
 	int r;
 
-	mutex_lock(&group->stat_lock);
-
 	/*
 	 * Collect the per-cpu time buckets and average them into a
 	 * single time sample that is normalized to wallclock time.
@@ -218,10 +216,36 @@ static bool psi_update_stats(struct psi_
 	for_each_online_cpu(cpu) {
 		struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
 		unsigned long nonidle;
+		struct rq_flags rf;
+		struct rq *rq;
+		u64 now;
 
-		if (!groupc->nonidle_time)
+		if (!groupc->nonidle_time && !groupc->nonidle)
 			continue;
 
+		/*
+		 * We come here for two things: 1) periodic per-cpu
+		 * bucket flushing and averaging and 2) when the user
+		 * wants to read a pressure file. For flushing and
+		 * averaging, which is relatively infrequent, we can
+		 * be lazy and tolerate some raciness with concurrent
+		 * updates to the per-cpu counters. However, if a user
+		 * polls the pressure state, we want to give them the
+		 * most uptodate information we have, including any
+		 * currently active state which hasn't been timed yet,
+		 * because in case of an iowait or a reclaim run, that
+		 * can be significant.
+		 */
+		if (ondemand) {
+			rq = cpu_rq(cpu);
+			rq_lock_irq(rq, &rf);
+
+			now = cpu_clock(cpu);
+
+			groupc->nonidle_time += now - groupc->nonidle_start;
+			groupc->nonidle_start = now;
+		}
+
 		nonidle = nsecs_to_jiffies(groupc->nonidle_time);
 		groupc->nonidle_time = 0;
 		nonidle_total += nonidle;
@@ -229,13 +253,27 @@ static bool psi_update_stats(struct psi_
 		for (r = 0; r < NR_PSI_RESOURCES; r++) {
 			struct psi_resource *res = &groupc->res[r];
 
+			if (ondemand && res->state != PSI_NONE) {
+				bool is_full = res->state == PSI_FULL;
+
+				res->times[is_full] += now - res->state_start;
+				res->state_start = now;
+			}
+
 			some[r] += (res->times[0] + res->times[1]) * nonidle;
 			full[r] += res->times[1] * nonidle;
 
-			/* It's racy, but we can tolerate some error */
 			res->times[0] = 0;
 			res->times[1] = 0;
 		}
+
+		if (ondemand)
+			rq_unlock_irq(rq, &rf);
+	}
+
+	for (r = 0; r < NR_PSI_RESOURCES; r++) {
+		do_div(some[r], max(nonidle_total, 1UL));
+		do_div(full[r], max(nonidle_total, 1UL));
 	}
 
 	/*
@@ -249,12 +287,10 @@ static bool psi_update_stats(struct psi_
 	 * activity, thus no data, and clock ticks are sporadic. The
 	 * below handles both.
 	 */
+	mutex_lock(&group->stat_lock);
 
 	/* total= */
 	for (r = 0; r < NR_PSI_RESOURCES; r++) {
-		do_div(some[r], max(nonidle_total, 1UL));
-		do_div(full[r], max(nonidle_total, 1UL));
-
 		group->some[r] += some[r];
 		group->full[r] += full[r];
 	}
@@ -301,7 +337,7 @@ static void psi_clock(struct work_struct
 	 * go - see calc_avgs() and missed_periods.
 	 */
 
-	nonidle = psi_update_stats(group);
+	nonidle = psi_update_stats(group, false);
 
 	if (nonidle) {
 		unsigned long delay = 0;
@@ -570,7 +606,7 @@ int psi_show(struct seq_file *m, struct
 	if (psi_disabled)
 		return -EOPNOTSUPP;
 
-	psi_update_stats(group);
+	psi_update_stats(group, true);
 
 	for (w = 0; w < 3; w++) {
 		avg[0][w] = group->avg_some[res][w];
_

Patches currently in -mm which might be from hannes@xxxxxxxxxxx are

mm-workingset-tell-cache-transitions-from-workingset-thrashing.patch
delayacct-track-delays-from-thrashing-cache-pages.patch
sched-loadavg-consolidate-load_int-load_frac-calc_load.patch
sched-loadavg-make-calc_load_n-public.patch
sched-schedh-make-rq-locking-and-clock-functions-available-in-statsh.patch
sched-introduce-this_rq_lock_irq.patch
psi-pressure-stall-information-for-cpu-memory-and-io.patch
psi-cgroup-support.patch
psi-aggregate-ongoing-stall-events-when-somebody-reads-pressure.patch

--
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html