[PATCH 00/26] Performance-related backports for 4.12.2

Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> · Thu, 20 Jul 2017 22:21:18 +0100

This is a second round of performance-related backports based on low-hanging
fruit in the 4.13 merge window based on 4.12.2.

As before, these have only been tested on 4.12-stable.  While they may
merge against older kernels, I have no data on how it behaves and cannot
guarantee it's a good idea so I don't recommend it.  There will also be
some major conflicts that are not trivial to resolve.

For most of the tests I conducted, the impact is marginal but patches the
first two sets of patches are important for large machines and for uses
of nohz_full. The load balancing patch is fairly specific but measurable.
The removal of unnecessary IRQ disabling/enabling is borderline in terms of
performance but they are trivial patches and avoiding unnecessary expensive
operations is always a plus.

Patches 1-17 resolve a number of topology problems in the scheduler that
	primarily impact NUMA machines with a ring topology. There are
	more patches in there than necessary but one adds very helpful
	comments on understanding how it works and a few bring the naming of
	functions in line with 4.13 which makes it a bit easier to follow.
	Others shuffle comments around and restructure the code which could
	have been avoided but then the backported patches would not look
	like their upstream equivalent.  While some of the extra patches are
	outside the scope of -stable, it removes the delta when comparing
	the 4.12-stable and 4.13 scheduler but I can drop them if necessary.

	Performance impact on UMA and fully-connected machines is marginal
	with minor gains/losses across multiple machines that is mostly
	within the noise but other reports indicate that the impact on
	ring topologies is substantial. In particular, the full machine
	will be properly utilised instead of saturating a subset of nodes
	for workloads with lots of threads or processes.

Patches 18-22 are more about accounting than performance. The bug is with
	workloads running on nohz_full+isolcpus configurations. If 2 or more
	processes are running on an isolated CPU are 100% userspace bound
	and normal processes are running on other CPUs then the isolated
	processes report a mix of userspace and system CPU usage.  It can
	be up to 100% system CPU usage even though in reality there is no
	time being spent in the kernel. This misaccounting is confusing
	when analysing workloads.

	For normal workloads, there is no measurable difference.

Patch 23 fixes a scheduler load balancing issue where an imbalanced domain
	is considered balanced when some tasks are pinned for affinity.
	Again, for many workloads the impact is marginal but it was a small
	boost (1-2% barely outside noise) for a specjbb configuration
	that pinned JVMs. It may be co-incidence but the patch is
	straight-forward.

Patches 24-25 avoid unnecessary IRQ disable/enable while updating writeback
	stats. In many cases this will not be noticable because it happens
	out-of-band and the cost of stats updates are often negligible
	compared to the overall cost of writeback. However, unnecessary
	IRQ disabling is never a good thing and it may be noticable during
	writeback to ultra-fast storage.

Patches 26 avoids an IRQ disable/enable in the fork path. It's noticable
	on fork-intensive workloads with a 1-3% boost on hackbench for
	example that is just outside the noise.

-- 
2.13.1