Futex hash_bucket lock can break isolation and cause priority inversion on RT

Juri Lelli <juri.lelli@xxxxxxxxxx> · Tue, 8 Oct 2024 16:22:26 +0100

Hello,

A report concerning latency sensitive applications using futexes on a
PREEMPT_RT kernel brought me to (try to!) refresh my understanding of
how futexes are implemented. The following is an attempt to make sense
of what I am seeing from traces, validate that it indeed might make
sense and possibly collect ideas on how to address the issue at hand.

Simplifying what is actually a quite complicated setup composed of
non-realtime (i.e., background load mostly related to a containers
orchestrator) and realtime tasks, we can consider the following
situation:

 - Multiprocessor system running a PREEMPT_RT kernel
 - Housekeeping CPUs (usually 2) running background tasks + “isolated”
   CPUs running latency sensitive tasks (possibly need to run also
   non-realtime activities at times)
 - CPUs are isolated dynamically by using nohz_full/rcu_nocbs options
   and affinity, no static scheduler isolation is used (i.e., no
   isolcpus=domain)
 - Threaded IRQs, RCU related kthreads, timers, etc. are configured with
   the highest priorities on the system (FIFO)
 - Latency sensitive application threads run at FIFO priority below the
   set of tasks from the former point
 - Latency sensitive application uses futexes, but they protect data
   only shared among tasks running on the isolated set of CPUs
 - Tasks running on housekeeping CPUs also use futexes
 - Futexes belonging to the above two sets of non interacting tasks are
   distinct

Under these conditions the actual issue presents itself when:

 - A background task on a housekeeping CPUs enters sys_futex syscall and
   locks a hb->lock (PI enabled mutex on RT)
 - That background task gets preempted by a higher priority task (e.g.
   NIC irq thread)
 - A low latency application task on an isolated CPU also enters
   sys_futex, hash collision towards the background task hb, tries to
   grab hb->lock and, even if it boosts the background task, it still
   needs to wait for the higher priority task (NIC irq) to finish
   executing on the housekeeping CPU and eventually misses its deadline

Now, of course by making the latency sensitive application tasks use a
higher priority than anything on housekeeping CPUs we could avoid the
issue, but the fact that an implicit in-kernel link between otherwise
unrelated tasks might cause priority inversion is probably not ideal?
Thus this email.

Does this report make any sense? If it does, has this issue ever been
reported and possibly discussed? I guess it’s kind of a corner case, but
I wonder if anybody has suggestions already on how to possibly try to
tackle it from a kernel perspective.

Thanks!
Juri