Re: [RFC PATCH v1 1/4] Introducing qpw_lock() and per-cpu queue & flush work

Waiman Long <longman@xxxxxxxxxx> · Wed, 4 Sep 2024 20:08:12 -0400

On 9/4/24 17:39, Waiman Long wrote:
On 6/21/24 23:58, Leonardo Bras wrote:
Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps cache bouncing low 
since
cacheline tends to be mostly local, and avoids the cost of locks in 
non-RT
kernels, even though the very few remote operations will be expensive 
due
to scheduling overhead.

On the other hand, for RT workloads this can represent a problem: 
getting
an important workload scheduled out to deal with some unrelated task is
sure to introduce unexpected deadline misses.

It's interesting, though, that local_lock()s in RT kernels become
spinlock(). We can make use of those to avoid scheduling work on a 
remote
cpu by directly updating another cpu's per_cpu structure, while holding
it's spinlock().

In order to do that, it's necessary to introduce a new set of 
functions to
make it possible to get another cpu's per-cpu "local" lock 
(qpw_{un,}lock*)
and also the corresponding queue_percpu_work_on() and 
flush_percpu_work()
helpers to run the remote work.

On non-RT kernels, no changes are expected, as every one of the 
introduced
helpers work the exactly same as the current implementation:
qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
queue_percpu_work_on()  ->  queue_work_on()
flush_percpu_work()     ->  flush_work()

For RT kernels, though, qpw_{un,}lock*() will use the extra cpu 
parameter
to select the correct per-cpu structure to work on, and acquire the
spinlock for that cpu.

queue_percpu_work_on() will just call the requested function in the 
current
cpu, which will operate in another cpu's per-cpu object. Since the
local_locks() become spinlock()s in PREEMPT_RT, we are safe doing that.

flush_percpu_work() then becomes a no-op since no work is actually
scheduled on a remote cpu.

Some minimal code rework is needed in order to make this mechanism work:
The calls for local_{un,}lock*() on the functions that are currently
scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), 
so in
RT kernels they can reference a different cpu. It's also necessary to 
use a
qpw_struct instead of a work_struct, but it just contains a work struct
and, in PREEMPT_RT, the target cpu.

This should have almost no impact on non-RT kernels: few this_cpu_ptr()
will become per_cpu_ptr(,smp_processor_id()).

On RT kernels, this should improve performance and reduce latency by
removing scheduling noise.

Signed-off-by: Leonardo Bras <leobras@xxxxxxxxxx>
---
  include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
  1 file changed, 88 insertions(+)
  create mode 100644 include/linux/qpw.h

diff --git a/include/linux/qpw.h b/include/linux/qpw.h
new file mode 100644
index 000000000000..ea2686a01e5e
--- /dev/null
+++ b/include/linux/qpw.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_QPW_H
+#define _LINUX_QPW_H

I would suggest adding a comment with a brief description of what 
qpw_lock/unlock() are for and their use cases. The "qpw" prefix itself 
isn't intuitive enough for a casual reader to understand what they are for.

Cheers,
Longman