Re: Observation on NOHZ_FULL

Andrea Righi <andrea.righi@xxxxxxxxxxxxx> · Tue, 30 Jan 2024 12:27:43 +0100

On Tue, Jan 30, 2024 at 12:06:49PM +0100, Uladzislau Rezki wrote:
> On Tue, Jan 30, 2024 at 02:17:22AM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 30, 2024 at 07:58:18AM +0100, Andrea Righi wrote:
> > > Hi Joel and Paul,
> > > 
> > > comments below.
> > > 
> > > On Mon, Jan 29, 2024 at 05:16:38PM -0500, Joel Fernandes wrote:
> > > > Hi Paul,
> > > > 
> > > > On 1/29/2024 3:41 PM, Paul E. McKenney wrote:
> > > > > On Mon, Jan 29, 2024 at 05:47:39PM +0000, Joel Fernandes wrote:
> > > > >> Hi Guys,
> > > > >> Something caught my eye in [1] which a colleague pointed me to
> > > > >>  - CONFIG_HZ=1000 : 14866.05 bogo ops/s
> > > > >>  - CONFIG_HZ=1000+nohz_full : 18505.52 bogo ops/s
> > > > >>
> > > > >> The test in concern is:
> > > > >> stress-ng --matrix $(getconf _NPROCESSORS_ONLN) --timeout 5m --metrics-brief
> > > > >>
> > > > >> which is a CPU intensive test.
> > > > >>
> > > > >> Any thoughts on what else can attribute a 30% performance increase
> > > > >> versus non-nohz_full ? (Confession: No idea if the baseline is
> > > > >> nohz_idle or no nohz at all). If it is 30%, I may want to evaluate
> > > > >> nohz_full on some of our limited-CPU devices :)
> > > > > 
> > > > > The usual questions.  ;-)
> > > > > 
> > > > > Is this repeatable?  Is it under the same conditions of temperature,
> > > > > load, and so on?  Was it running on bare metal or on a guest OS?  If on a
> > > > > guest OS, what was the load from other guest OSes on the same hypervisor
> > > > > or on the hypervisor itself?
> > > 
> > > That was the result of a quick test, so I expect it has some fuzzyness
> > > in there.
> > > 
> > > It's an average of 10 runs, it was bare metal (my laptop, 8 cores 11th
> > > Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz), *but* I wanted to run the
> > > test with the default Ubuntu settings, that means having "power mode:
> > > balanced" enabled. I don't know exactly what it's doing (I'll check how
> > > it works in details), I think it's using intel p-states IIRC.
> > > 
> > > Also, the system was not completely isolated (my email client was
> > > running) but the system was mostly idle in general.
> > > 
> > > I was already planning to repeat the tests in a more "isolated"
> > > environment and add details to the bug tracker.
> > > 
> > > > > 
> > > > > The bug report ad "CONFIG_HZ=250 : 17415.60 bogo ops/s", which makes
> > > > > me wonder if someone enabled some heavy debug that is greatly
> > > > > increasing the overhead of the scheduling-clock interrupt.
> > > > > 
> > > > > Now, if that was the case, I would expect the 250HZ number to have
> > > > > three-quarters of the improvement of the nohz_full number compared
> > > > > to the 1000HZ number:
> > > > >> 17415.60-14866.05=2549.55
> > > > > 18505.52-14866.05=3639.47
> > > > > 
> > > > > 2549.55/3639.47=0.70
> > > > 
> > > > I wonder if the difference here could possibly also be because of CPU idle
> > > > governor. It may behave differently at differently clock rates so perhaps has
> > > > different overhead.
> > > 
> > > Could be, but, again, the balanced power mode could play a major role
> > > here.
> > > 
> > > > 
> > > > I have added trying nohz full to my list as well to evaluate. FWIW, when we
> > > > moved from 250HZ to 1000HZ, it actually improved power because the CPUidle
> > > > governor could put the CPUs in deeper idle states more quickly!
> > > 
> > > Interesting, another benefit to add to my proposal. :)
> > > 
> > > > 
> > > > > OK, 0.70 is not *that* far off of 0.75.  So what debugging does that
> > > > > test have enabled?  Also, if you use tracing (or whatever) to measure
> > > > > the typical duration of the scheduling-clock interrupt and related things
> > > > > like softirq handlers, does it fit with these numbers?  Such a measurment
> > > > > would look at how long it took to get back into userspace.
> > 
> > Just to emphasize...
> > 
> > The above calculations show that your measurements are close to what you
> > would expect if scheduling-clock interrupts took longer than one would
> > expect.  Here "scheduling-clock interrupts" includes softirq processing
> > (timers, networking, RCU, ...)  that piggybacks on each such interrupt.
> > 
> > Although softirq makes the most sense given the amount of time that must
> > be consumed, for the most part softirq work is conserved.  which suggests
> > that you should also at the rest of the system to check whether the
> > reported speedup is instead due to this work simply being moved to some
> > other CPU.
> > 
> > But maybe the fat softirqs are due to some debugging option that Ubuntu
> > enabled.  In which case checking up on the actual duration (perhaps
> > using some form of tracing) would provide useful information.  ;-)
> > 
> As a first step i would have a look at perf figures what is going on
> during a test run. For such purpose the "perf" tool can be used. As a
> basic step it can be run in a "top" mode:
> 
> perf top -a -g -e cycles:k 
> 
> Sorry for the noise :)

Yep, I'm planning to do better tests and collect more info (perf,
bpftrace). Also making sure that we don't have some crazy debugging
config enabled in the Ubuntu kernel, as correctly pointed by Paul. But
first of all I need to repeat the tests in a more isolated environment,
just to make sure we're looking at reasonable numbers here.

Thanks,
-Andrea