Re: About rtla osnoise and timerlat usage

Steven Rostedt <rostedt@xxxxxxxxxxx> · Thu, 23 Feb 2023 09:39:00 -0500

On Thu, 23 Feb 2023 11:17:03 -0300
Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> wrote:

> I am not sure if I understood what you mean but...
> 
> kworker/[120] <--- this 120 is likely not the same as
> ktimer/[97] <---- this 97
> 
> The kworker is likely a SCHED_OTHER 0 nice, and ktimer a FIFO:97.
> 
> You are placing your load in between them.
> 
> That would not be bad if we ran a traditional periodic/sporadic real-time
> workload. That is, task that waits for an event, wakes up, runs, and goes
> to sleep waiting for the next event.
> 
> The problem is that oslat/osnoise run non-stop.
> 
> Then a kworker awakened on the CPU will... starve. You will not see it
> causing a sched_switch, but if the kworker is pinned to that CPU, it wil
> not make progress.

Note, the kworker and other kernel threads that are pinned to a CPU are
ones that service requests that were triggered on that CPU. It is possible
to run a task at FIFO 99 on an isolated CPU non stop without causing any
issue (you may also need to enable NO_HZ_FULL and make sure RCU has
no-callbacks enabled where the RCU for that isolated CPU gets its work done
on other CPUs).

If your FIFO task calls into the kernel and does something that triggers a
worker, then you may then have an issue. You will need to make sure that
worker gets time to run.

The point I'm making is that it is possible to get something working where
you have a FIFO task running 100%, but you need to set up the system where
it will not cause issues. That requires knowing what system calls that are
done on that CPU that may require workers.

Oh, and there's another issue that can cause problems. Even if you figured
out everything your task does, and make sure that it doesn't trigger any
pinned kworkers, and you are using NO_CB_RCU and NO_HZ_FULL, there's still
an issue that needs to be taken care of. That is, if there was some task
running on that CPU just before your FIFO task runs, it could have
triggered a kworker. And even though it may be done, or even migrated to
another CPU, that kworker will still need to execute. I've seen this cause
days of debugging to why the system crashed.

-- Steve