Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync

Christoph Lameter <cl@xxxxxxxxx> · Wed, 27 Apr 2022 11:19:02 +0200 (CEST)

Ok I actually have started an opensource project that may make use of the
onshot interface. This is a bridging tool between two RDMA protocols
called ib2roce. See https://gentwo.org/christoph/2022-bridging-rdma.pdf

The relevant code can be found at
https://github.com/clameter/rdma-core/tree/ib2roce/ib2roce. In
particular look at the ib2roce.c source code. This is still
under development.

The ib2roce briding can run in a busy loop mode (-k option) where it spins
on ibv_poll_cq() which is an RDMA call to handle incoming packets without
kernel interaction. See busyloop() in ib2roce.c

Currently I have configured the system to use CONFIG_NOHZ_FULL. With that
I am able to reliably forward packets at a rate that saturates 100G
Ethernet / EDR Infiniband from a single spinning thread.

Without CONFIG_NOHZ_FULL any slight disturbance causes the forwarding to
fall behind which will lead to dramatic packet loss since we are looking
here at a potential data rate of 12.5Gbyte/sec and about 12.5Mbyte per
msec. If the kernel interrupts the forwarding by say 10 msecs then we are
falling behind by 125MB which would have to be buffered and processing by
additional codes. That complexity makes it processing packets much slower
which could cause the forwarding to slow down so that a recovery is not
possible should the data continue to arrive at line rate.

Isolation of the threads was done through the following kernel parameters:

nohz_full=8-15,24-31 rcu_nocbs=8-15,24-31 poll_spectre_v2=off
numa_balancing=disable rcutree.kthread_prio=3 intel_pstate=disable nosmt

And systemd was configured with the following affinites:

system.conf:CPUAffinity=0-7,16-23

This means that the second socket will be generally free of tasks and
kernel threads.

The NUMA configuration:

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 94798 MB
node 0 free: 92000 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 96765 MB
node 1 free: 96082 MB

node distances:
node   0   1
  0:  10  21
  1:  21  10

I could modify busyloop() in ib2roce.c to use the oneshot mode via prctl
provided by this patch instead of the NOHZ_FULL.

What kind of metric could I be using to show the difference in idleness of
the quality of the cpu isolation?

The ib2roce tool already has a CLI mode where one can monitor the
latencies that the busyloop experiences. See the latency calculations in
busyloop() and the CLI command "core". Stats can be reset via the "zap"
command.

I can see the usefulness of the oneshot mode but (I am very very sorry) I
still think that this patchset overdoes what is needed and I fail to
understand what the point of inheritance, per syscall quiescint etc is.
Those cause needless overhead in syscall handling and increase the
complexity of managing a busyloop. Special handling when the scheduler
switches a task? If tasks are being switched that requires them to be low
latency and undisturbed then something went very very wrong with the
system configuration and the only thing I would suggest is to issue some
kernel warning that this is not the way one should configure the system.