Re: About rtla osnoise and timerlat usage

Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> · Wed, 22 Feb 2023 10:15:34 -0300

On 2/22/23 09:39, Prasad Pandit wrote:
> Hello Daniel,
> 
> Thank you so much for your reply, I appreciate it.
> 
> On Wed, 22 Feb 2023 at 17:30, Daniel Bristot de Oliveira <bristot@xxxxxxxxxx <mailto:bristot@xxxxxxxxxx>> wrote:
> 
>     This is the timerlat's timer, so it is expected. What this trace is pointing is to
>     a possible exit from idle latency... so idle tune is required for this system
>     and *this metric*... but
> 
> 
> * Idle tune?
>  
> 
>     Yes, that is expected on timerlat in an isolated CPU. But not with osnoise/oslat kind of tool,
>     as they keep running, while timerlat/cyclictest go to sleep.
> 
> 
> * I see, okay.
> 
>     Let me know how rtla osnoise results are, so I can help more. 
> 
> 
> * Yes, I've been running oslat(1) and rtla-osnoise(1) too.
>    Please see:
>     oslat(1) log -> https://0bin.net/paste/T0PDXHz5#AnNEzkTRxQVT1gvAqKM43jW+yhqilbNbFqHIHHpy4MY <https://0bin.net/paste/T0PDXHz5#AnNEzkTRxQVT1gvAqKM43jW+yhqilbNbFqHIHHpy4MY>
>     rtla-osnoise-top(1) log -> https://0bin.net/paste/8qwjebnZ#22sfTYTv68JAAMHZJhnCBTP-uvP7Mxj8ipAVbuQVsiy <https://0bin.net/paste/8qwjebnZ#22sfTYTv68JAAMHZJhnCBTP-uvP7Mxj8ipAVbuQVsiy>

The problem in the oslat case is that trace-cmd is awakened in the isolated CPU.

That is probably because trace-cmd once ran and armed a timer there.

I recommend you restrict the affinity of trace-cmd to the non-isolated CPUs before
starting it and run the experiment again.

However, a busy loop in FIFO:95 is not a good setup. That is because you have to
raise the priority of other things like the ktimer because of this. Like in your
example, ktimer as FIFO:97... it is hard to justify this as a sane setup.

In a properly isolated CPU, SCHED_OTHER should be enough. I understand that
people use FIFO because it gives the impression that the busy loop will
receive more CPU time, but this is biased by tools that only measure the
single latency occurrence - and not overall latency.

See this article: https://research.redhat.com/blog/article/osnoise-for-fine-tuning-operating-system-noise-in-linux-kernel/

While running with FIFO reduces the "max single noise" by two us (from 7 to 5 us)
in relation to the SCHED_OTHER, the total amount of noise that the tool running with
FIFO is larger because the starvation of tasks require further checks from the OS
side, generating further noise. So SCHED_OTHER is better for total noise.

In properly isolated systems, the solution is to try to avoid things on the CPUs,
not to starve them. If the system has a job that is pinned to a CPU that cannot
be avoided, just let it run. Keeping the system in the starving condition is
keeping the system in a faulty state, and the work to take the system out of
this situation (like using throttling or stalld) will only cause more noise.

-- Daniel