Unbounded priority inversion while assigning tasks into cgroups.

Ronny Meeus <ronny.meeus@xxxxxxxxx> · Mon, 25 Oct 2021 11:43:52 +0200

Hello

an unbounded priority inversion is observed when moving tasks into cgroups.
In my case I'm using the cpu and cpuacct cgroups but the issue is
independent of this.

Kernel version: 4.9.79
CPU: Dual core Cavium Octeon (MIPS)
Kernel configured with CONFIG_PREEMPT=y

I have a small application running at RT priority 92.
Its job is to move high CPU consuming applications into a cgroup when
the system is under high load.
Under extreme load conditions (meaning a lot of script processing
(process clone / exec / exit) and high application load), sometimes
the application hangs for a long time (can be a couple of seconds but
also hangs of 2 minutes are observed already).

Extending the kernel with traces (see below) showed that the
root-cause of the blocking is the global rwsem
"cgroup_threadgroup_rwsem".
While adding a task into the cgroup (__cgroup_procs_write), the write
lock is taken which will have to wait until all writers and readers
have completed their critical section which can take very long.
Especially since there are many of them running at a much lower
priority and we have also applications running at medium priority
running with a very high load.

As an initial attempt I tried applying the RT patch but this does not
resolve the issue.

The second attempt was to replace the cgroup_threadgroup_rwsem by a
rt_mutex (which offers priority inheritance).
After this change the issue seems to be resolved.
A disadvantage of this approach is that all accesses to the critical
section are serialized on all cores (writes to assign tasks to cgroups
and reads to create/exec/exit processes).

For the moment I do not see any other alternative to resolve this problem.
Any advice on the right way forward would be appreciated.

Best regards,
Ronny

Relevant part of the instrumented code of function: __cgroup_procs_write:

trace_cgroup_lock(1000);
percpu_down_write(&cgroup_threadgroup_rwsem);
trace_cgroup_lock(2000);
rcu_read_lock();

A normal trace looks like:
resource_monito-18855 [001] ....  2685.097016: cgroup_lock: idx=2
resource_monito-18855 [001] ....  2685.097017: cgroup_lock: idx=1000
resource_monito-18855 [001] ....  2685.097018: cgroup_lock: idx=2000
resource_monito-18855 [001] ....  2685.097018: cgroup_lock: idx=101

A trace of a blocked application looks like:
resource_monito-18855 [001] ....  2689.736364: cgroup_lock: idx=2
resource_monito-18855 [001] ....  2689.736365: cgroup_lock: idx=1000
resource_monito-18855 [001] ....  2693.780339: cgroup_lock: idx=2000
resource_monito-18855 [001] ....  2693.780339: cgroup_lock: idx=101

In the problematic case above, the resource_monitor application was
blocked for 4s waiting for the write lock on the cgroup.