Re: About rtla osnoise and timerlat usage

Prasad Pandit <ppandit@xxxxxxxxxx> · Mon, 27 Feb 2023 12:40:41 +0530

Hello Daniel, Steve,

On Thu, 23 Feb 2023 at 20:24, Daniel Bristot de Oliveira
<bristot@xxxxxxxxxx> wrote:
> On 2/23/23 11:39, Steven Rostedt wrote:
>>> kworker/[120] <--- this 120 is likely not the same as
>>> ktimer/[97] <---- this 97
>>>
>>> The kworker is likely a SCHED_OTHER 0 nice, and ktimer a FIFO:97.
>>> You are placing your load in between them.

* Oh right, even those threads have different priorities.

>>> That would not be bad if we ran a traditional periodic/sporadic real-time
>>> workload. That is, task that waits for an event, wakes up, runs, and goes
>>> to sleep waiting for the next event.
>>>
>>> The problem is that oslat/osnoise run non-stop.
>>>
>>> Then a kworker awakened on the CPU will... starve. You will not see it
>>> causing a sched_switch, but if the kworker is pinned to that CPU, it wil
>>> not make progress.
>>
>> Note, the kworker and other kernel threads that are pinned to a CPU are
>> ones that service requests that were triggered on that CPU. It is possible
>> to run a task at FIFO 99 on an isolated CPU non stop without causing any
>> issue (you may also need to enable NO_HZ_FULL and make sure RCU has
>> no-callbacks enabled where the RCU for that isolated CPU gets its work done
>> on other CPUs).
>
> Yes, but in the perfect isolation case, where no other task is scheduled there, being
> FIFO and OTHER or even IDLE is... equivalent as no scheduler is needed :-).
>
>> If your FIFO task calls into the kernel and does something that triggers a
>> worker, then you may then have an issue. You will need to make sure that
>> worker gets time to run.
>>
>> The point I'm making is that it is possible to get something working where
>> you have a FIFO task running 100%, but you need to set up the system where
>> it will not cause issues. That requires knowing what system calls that are
>> done on that CPU that may require workers.
>>
>> Oh, and there's another issue that can cause problems. Even if you figured
>> out everything your task does, and make sure that it doesn't trigger any
>> pinned kworkers, and you are using NO_CB_RCU and NO_HZ_FULL, there's still
>> an issue that needs to be taken care of. That is, if there was some task
>> running on that CPU just before your FIFO task runs, it could have
>> triggered a kworker. And even though it may be done, or even migrated to
>> another CPU, that kworker will still need to execute. I've seen this cause
>> days of debugging to why the system crashed.
>
> There are also cases where kworkers are dispatched to all CPUs, from a non-isolated CPU,
> to do some house-keeping work. E.g., I think that ftrace used to do that to allocate buffers.
> Ideally, all these cases should be reworked to avoid dispatching kworkers where they are
> not needed. But as kworkers are added to the code as part of the development, and bad
> 3rd part drivers can also do it... and... who knows?
>
> in the exceptional case of something happening to that CPU, they are likely sort living
> kernel work that is is just easier to let them run, one monitors those cases and try
> to fix the code to avoid them.
>
> That is why the safest path is to: assuming that the isolcpus is done at the perfection,
> no schedule will happen, and so all the schedulers are equivalent.
>

* I see, got it. Thank you so much for your kind replies and detailed
explanations, I appreciate it.

Thank you.
---
  - P J P