On 10/8/24 11:22 AM, Juri Lelli wrote:
Hello, A report concerning latency sensitive applications using futexes on a PREEMPT_RT kernel brought me to (try to!) refresh my understanding of how futexes are implemented. The following is an attempt to make sense of what I am seeing from traces, validate that it indeed might make sense and possibly collect ideas on how to address the issue at hand. Simplifying what is actually a quite complicated setup composed of non-realtime (i.e., background load mostly related to a containers orchestrator) and realtime tasks, we can consider the following situation: - Multiprocessor system running a PREEMPT_RT kernel - Housekeeping CPUs (usually 2) running background tasks + “isolated” CPUs running latency sensitive tasks (possibly need to run also non-realtime activities at times) - CPUs are isolated dynamically by using nohz_full/rcu_nocbs options and affinity, no static scheduler isolation is used (i.e., no isolcpus=domain) - Threaded IRQs, RCU related kthreads, timers, etc. are configured with the highest priorities on the system (FIFO) - Latency sensitive application threads run at FIFO priority below the set of tasks from the former point - Latency sensitive application uses futexes, but they protect data only shared among tasks running on the isolated set of CPUs - Tasks running on housekeeping CPUs also use futexes - Futexes belonging to the above two sets of non interacting tasks are distinct Under these conditions the actual issue presents itself when: - A background task on a housekeeping CPUs enters sys_futex syscall and locks a hb->lock (PI enabled mutex on RT) - That background task gets preempted by a higher priority task (e.g. NIC irq thread) - A low latency application task on an isolated CPU also enters sys_futex, hash collision towards the background task hb, tries to grab hb->lock and, even if it boosts the background task, it still needs to wait for the higher priority task (NIC irq) to finish executing on the housekeeping CPU and eventually misses its deadline Now, of course by making the latency sensitive application tasks use a higher priority than anything on housekeeping CPUs we could avoid the issue, but the fact that an implicit in-kernel link between otherwise unrelated tasks might cause priority inversion is probably not ideal? Thus this email. Does this report make any sense? If it does, has this issue ever been reported and possibly discussed? I guess it’s kind of a corner case, but I wonder if anybody has suggestions already on how to possibly try to tackle it from a kernel perspective.
Just a question. Is the low latency application using PI futex or the normal wait-wake futex? We could use separate set of hash buckets for these distinct futex types.
Cheers, Longman