Re: CONFIG_HZ impact on system load

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

After some experimentation, tracing and debug, I’ve arrived at the
conclusion that it is only a problem of how the load is measured and not
a “real” system load. Explanation follows and conclusion at the bottom
(if you want to skip ahead).

First, some context about how the load is computed (most of which I
discovered during this):

The “tick” is basically a periodical function called CONFIG_HZ times a
second on every CPU. Tasks have two states that are counted in the load:
running and uninterruptible (for when the task is waiting for
something). On ticks spaced by ~5s, the number of tasks in these two
states is summed to compute the current load and this value is injected
into a filter to get values evolving in 1, 5 and 15 minutes (as seen in
/proc/loadavg). See the excellent blog post from  Brendan Gregg for more
information [1].

The above paragraph ignores the NoHZ feature that allows ticks to not
happen if not needed (if the CPU is idle or running only one task for
exemple). Enabling/disabling the NoHZ feature did not change the problem
so it was disabled (nohz=off on cmdline) to make the analysis easier.

First, I used “trace-cmd analyze” (a not-yet merged patch from Steven
Rostedt [2], rebased on master here [3]). This uses ftrace to compute
statistics about tasks. In particular, it does compute the time each
task spends in the running and uninterruptible states (measured on
scheduler switches). The global time tasks spent in these states does
not change with HZ. So there is no additional load when HZ decreases.
Moreover, the theoretical load computed over a long time is what you
would expect : slightly above 1 for a stress process + the rest of the
system when the load average shows values above 2.

Using ftrace, we can dump the load sample every 5s (before filtering).
This shows a base value of 1-2 but many spikes around 10. These spikes
are the reason loadavg shows a load significantly higher than 1.

To look for the sources of these spikes, we can get the instantaneous
load using the scheduler debugfs directory : $debugfs/sched/debug allows
us to compute the number of tasks in the running and uninterruptible
state (note: the counter of tasks in uninterruptible state is
distributed across all CPUs. See kernel source kernel/sched/loadavg.c).
By sampling this value at 10Hz we get no spikes and an average load of ~1.

At this point, the hypothesis is that the spikes are short enough to not
be visible from userland. So, the next step is to precisely trace the
evolution of the instantaneous load. By “ftracing” the value composing
the load (nr_running & nr_running for each CPU), we can retroactively
compute the load at each point in time. This shows the instantaneous
load increases suddenly (spikes are 15-20 high and ~20µs wide) and, at
the same time the load is sampled (AFAIK, the sampling does not cause
the load increase, this is just simultaneous). And by taking stacktraces
on load increase in the spikes, we see that some tasks are activated
during the tick (for example, RCU related periodic tasks are activated
on ticks).

At last, the conclusion: The load is sampled at the end of the tick
function. During the tick function, some tasks may activate and increase
the instantaneous load. This creates a spike in load which is then
sampled and averaged. This creates a bias in the load measure and
displays higher load than in reality.

The influence of CONFIG_HZ on this bias is this : lower HZ increases the
probability of activating tasks exactly at the tick that samples the load.

All of this was observed on a CONFIG_PREEMPT_RT=y kernel. This leaves a
question : as per the original post, the anomalous load does not happen
on a !PREEMPT_RT kernel. Why? My (un-tested) guess is that the RT
preemption model allows the awakened threads to increase the load during
the tick whereas the non-RT preemption model increases the load after
the tick.

Finally, more general questions : Is the load increase when changing
!RT/RT or HZ generally seen? Is this a bug in loadavg?

Links:
[1]: https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
[2]: https://lwn.net/ml/linux-rt-users/20220324025726.1727204-1-rostedt@xxxxxxxxxxx/
[3]: https://github.com/ycongal-smile/trace-cmd/tree/trace-cmd-analyze

--
Yoann Congal
Smile ECS - Expert technique




[Index of Archives]     [RT Stable]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]

  Powered by Linux