Re: [RESEND PATCH v2] eventfd: introduce ratelimited wakeup for non-semaphore eventfd

Wen Yang <wen.yang@xxxxxxxxx> · Thu, 15 Aug 2024 22:53:16 +0800

On 2024/8/15 00:50, Jens Axboe wrote:
On 8/14/24 10:15 AM, Wen Yang wrote:

On 2024/8/11 18:26, Mateusz Guzik wrote:
On Sun, Aug 11, 2024 at 04:59:54PM +0800, Wen Yang wrote:
For the NON-SEMAPHORE eventfd, a write (2) call adds the 8-byte integer
value provided in its buffer to the counter, while a read (2) returns the
8-byte value containing the value and resetting the counter value to 0.
Therefore, the accumulated value of multiple writes can be retrieved by a
single read.

However, the current situation is to immediately wake up the read thread
after writing the NON-SEMAPHORE eventfd, which increases unnecessary CPU
overhead. By introducing a configurable rate limiting mechanism in
eventfd_write, these unnecessary wake-up operations are reduced.

[snip]

     # ./a.out  -p 2 -s 3
     The original cpu usage is as follows:
09:53:38 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
09:53:40 PM    2   47.26    0.00   52.74    0.00    0.00    0.00    0.00    0.00    0.00    0.00
09:53:40 PM    3   44.72    0.00   55.28    0.00    0.00    0.00    0.00    0.00    0.00    0.00

09:53:40 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
09:53:42 PM    2   45.73    0.00   54.27    0.00    0.00    0.00    0.00    0.00    0.00    0.00
09:53:42 PM    3   46.00    0.00   54.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

09:53:42 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
09:53:44 PM    2   48.00    0.00   52.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
09:53:44 PM    3   45.50    0.00   54.50    0.00    0.00    0.00    0.00    0.00    0.00    0.00

Then enable the ratelimited wakeup, eg:
     # ./a.out  -p 2 -s 3  -r1000 -c2

Observing a decrease of over 20% in CPU utilization (CPU # 3, 54% ->30%), as shown below:
10:02:32 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
10:02:34 PM    2   53.00    0.00   47.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
10:02:34 PM    3   30.81    0.00   30.81    0.00    0.00    0.00    0.00    0.00    0.00   38.38

10:02:34 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
10:02:36 PM    2   48.50    0.00   51.50    0.00    0.00    0.00    0.00    0.00    0.00    0.00
10:02:36 PM    3   30.20    0.00   30.69    0.00    0.00    0.00    0.00    0.00    0.00   39.11

10:02:36 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
10:02:38 PM    2   45.00    0.00   55.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
10:02:38 PM    3   27.08    0.00   30.21    0.00    0.00    0.00    0.00    0.00    0.00   42.71

Where are these stats from? Is this from your actual program you coded
the feature for?

The program you inlined here does next to nothing in userspace and
unsurprisingly the entire thing is dominated by kernel time, regardless
of what event rate can be achieved.

For example I got: /a.out -p 2 -s 3  5.34s user 60.85s system 99% cpu 66.19s (1:06.19) total

Even so, looking at perf top shows me that a significant chunk is
contention stemming from calls to poll -- perhaps the overhead will
sufficiently go down if you epoll instead?

We have two threads here, one publishing and one subscribing, running
on CPUs 2 and 3 respectively. If we further refine and collect
performance data on CPU 2, we will find that a large amount of CPU is
consumed on the spin lock of the wake-up logic of event write, for
example:

This is hardly surprising - you've got probably the worst kind of
producer/consumer setup here, with the producer on one CPU, and the
consumer on another. You force this relationship by pinning both of
them. Then you have a queue in between, and locking that needs to be
acquired on both sides.

Thank you for pointing it out.
We bind the CPU here to highlight this issue.
In fact, setting cpumask to -1 still remains the same:

 ./a.out  -p -1 -s -1

     9.27%  [kernel]       [k] _raw_spin_lock_irq
     6.23%  [kernel]       [k] vfs_write

And another test program using libzmq also did not bind the CPU:
https://github.com/taskset/tests/blob/master/src/test.c

We can indeed solve this problem in user mode by using methods such as 
shared memory, periodic data reading, atomic variables, etc. instead of 
eventfd.

But since eventfd has already provided *NON-SEMAPHORE* , could you also 
guide us to further utilize it and make it more comprehensive?

Especially linux is increasingly being used in automotive scenarios.

--
Best wishes,
Wen