Re: Short process stall after assigning it to a cgroup

Ronny Meeus <ronny.meeus@xxxxxxxxx> · Fri, 25 Jun 2021 09:32:59 +0200

Op wo 23 jun. 2021 om 19:28 schreef Michal Koutný <mkoutny@xxxxxxxx>:
>
> Hello Ronny.
>
> On Mon, Jun 14, 2021 at 05:29:35PM +0200, Ronny Meeus <ronny.meeus@xxxxxxxxx> wrote:
> > All apps are running in the realtime domain and I'm using kernel 4.9
> > and cgroup v1. [...]  when it enters a full load condition [...]
> > I start to gradually reduce the budget of the cgroup until the system
> > is idle enough.
>
> Has your application some RT requirements or is there other reason why
> you use group RT allocations? (When your app seems to require all CPU
> time, you decide to curb it. And it still fullfills RT requirements?)
>

The application does not have strict RT requirements.
The main reason for using cgroups is to reduce the load of the high
consumer applications when the system is under high load so that also
lower prio apps can have a portion of the CPU.
We were working with fixed croups initially but this has the big
disadvantage that the unused budget configured in one group cannot be
used by another group and as such the processing power is basically
lost.

>
> > But sometimes, immediately after the process assignment, it stops for
> > a short period (something like 1 or 2s) and then starts to consume 40%
> > again.
>
> What if you reduce cpu.rt_period_us (and cpu.rt_runtime_us
> proportionally)? (Are the pauses shorter?) Is there any useful info in
> /proc/$PID/stack during these periods?
>

I tried to use shorter periods like 100ms instead of 1s but the
problem is still observed.
Using a proportionally reducing algo is more complex to implement and
I think would not solve the issue either.

About the stack: it is difficult to know from the SW when the issue
happens so dumping the stack is not easy I think but it is a good
idea.
I will certainly think about it.
To observe the system I use a spirent traffic generator which shows me
the number of processed packets in a nice graph. In this way it is
easy to see that there are short peaks when the system is not
returning any packets.

> > Is that expected behavior?
>
> Someone with RT group schedulling knowledge may tell :-)
>
> HTH,
> Michal