Re: cyclictest better values with system load than without (OMAP3530 target)

Carsten Emde <C.Emde@xxxxxxxxx> · Fri, 29 Nov 2013 16:10:03 +0100

On 11/29/2013 01:56 PM, Sebastian Andrzej Siewior wrote:
* Clark Williams | 2013-11-26 10:12:32 [-0600]:

In my experience (on x86_64 mainly), that behavior (worse times when
not under load) is due to the overhead of coming out of power-save/idle
states. When you've got a big load on the system and all the cores are
active, then the power-save logic and/or the idle logic doesn't kick in
and devices aren't being powered down.
This is the case here, too. The overhead comming out of a deep power
state plus the invalidated caches.
Sorry, I feel that the discussion a somewhat out of sync with the 
original posting. Let me explain.

Among others, processors may use two completely different interfaces to 
save power:
1. Sleep states aka C states, Linux interface cpuidle
2. Clock frequency modulation aka P states, Linux interface cpufreq

1. Sleep states
Processors may come with a number of C states from light sleep to deep 
sleep to save power when idle. The longer a processor is idle, the 
deeper normally is the sleep state the processor may enter. Sleep states 
may be disabled i) on a per-processor and per-state basis in 
/sys/devices/system/cpu/cpuX/cpuidle/stateX/disable or ii) altogether 
using the somewhat mislabeled /dev/cpu_dma_latency pseudo device. As far 
as cyclictest is concerned, sleep states normally are disabled 
altogether. If this is the case, cyclictest prints the message:
# /dev/cpu_dma_latency set to 0us
The original posting contains this line. In consequence, sleep states 
cannot be responsible for any observed latency prolongation. To check 
whether sleep states are disabled, the command
  # cat /sys/devices/system/cpu/cpu0/cpuidle/state?/time
may be used repeatedly for every CPU. If sleep states are disabled 
correctly, only the first state (poll state) may increase such as

# cat /sys/devices/system/cpu/cpu0/cpuidle/state?/time
444330737734
234393550
1760323375
1234658099
183251179053

and sometime later

# cat /sys/devices/system/cpu/cpu0/cpuidle/state?/time
444417947595
234393550
1760323375
1234658099
183251179053

BTW: The cyclictest source contains a related comment:
/* Latency trick
 * if the file /dev/cpu_dma_latency exists,
 * open it and write a zero into it. This will tell
 * the power management system not to transition to
 * a high cstate (in fact, the system acts like idle=poll)
 * When the fd to /dev/cpu_dma_latency is closed, the behavior
 * goes back to the system default.
 *
 * Documentation/power/pm_qos_interface.txt
 */

2. Clock frequency modulation
This is an entirely different story as cylictest has no business with it 
at all. The clock frequency of x86 processors has a more or less linear 
effect on latency, e.g. a system running at 1 GHz will show a latency 
that is twice as high as when running at 2 GHz. ARM processors, however, 
behave differently. Many ARM cores do not provide acceptable latency 
values unless running at full speed. It is, therefore, often mandatory 
to switch to the performance CPU frequency governor before starting 
cyclictest or before running a real-world user space application that 
relies on minimum latency. The /sys/devices/system/cpu/cpu0/cpufreq 
interface is available to manage P states:
Switch to maximum performance:
cd /sys/devices/system/cpu/
for i in cpu?/cpufreq/scaling_governor
do
  echo performance >$i
done
Switch to on-demand frequency modulation:
for i in cpu?/cpufreq/scaling_governor
do
  echo ondemand >$i
done

BTW: Power saving and real-time do not necessarily exclude each other. 
If a - still deterministic - but a little longer latency is acceptable, 
some light sleep states and a somewhat lower clock frequency may be 
allowed which still may result in considerable energy saving. If, 
however, the fastest possible real-time response is required, C states 
and P states must be disabled (or set to polling and maximum speed, 
repsectively) and the power bill must be payed.

So the test now finally has better results on a idle system than on
one with heavy system load. The numbers are still far away from your
latency values on the 1.2GHz Kirkwood. Does anybody have OMAP3
values at hand to compare?
This is why we run the OSADL QA Farm. An AM3359 system is in rack 7, 
slot 5 -> https://www.osadl.org/?id=1590. We run 100 million cycles with 
200 µs cycle interval (which takes about 5 hours and 33 minutes) to 
obtain reliable data. In addition, the processor is in idle state but 
also executing defined load scenarios during the recording. Please do 
the same before you compare the results. To facilitate the comparison, 
the cyclictest command line is given below every plot, and any other 
relevant information (including kernel command line) is available in the 
systems' profiles.

Hope this helps,
	-Carsten.
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html