Re: [linux-next:master] [pipe_read] aaec5a95d5: stress-ng.poll.ops_per_sec 11.1% regression

Mateusz Guzik <mjguzik@xxxxxxxxx> · Mon, 20 Jan 2025 12:27:21 +0100

On Mon, Jan 20, 2025 at 02:57:21PM +0800, kernel test robot wrote:
> we reported
> "[brauner-vfs:vfs-6.14.misc] [pipe_read]  aaec5a95d5: hackbench.throughput 7.5% regression"
> in
> https://lore.kernel.org/all/202501101015.90874b3a-lkp@xxxxxxxxx/
> but seems both you and Christian Brauner think it could be ignored.
> 
> now we captured a regression in another test case. since e.g. there are
> something like below in perf data, not sure if it could supply any useful
> information? just FYI. sorry if it's still not with big value.
> 
> 
>       9.45            -6.3        3.13 ±  9%  perf-profile.calltrace.cycles-pp.pipe_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
> ...
>      10.00            -6.5        3.53 ±  9%  perf-profile.children.cycles-pp.pipe_read
>       2.34            -1.3        1.07 ±  9%  perf-profile.children.cycles-pp.pipe_poll
> 
> 

Whatever the long term fate of the patch I think it would be prudent to
skip it in this merge window.

First two notes:
1. the change only considers performing a wake up if the current
source buf got depleted -- if there is a blocked writer and there is at
least one byte in the current buf nothing happens, which is where the
difference in results is coming from
2. stress-ng is not really a microbenchmark suite. it is more of a "do
stuff, some of which may accidentally line up with isolated behavior
from real workloads and return ops/s". plenty of it does not line up
with anythinig afaics (spoiler for tee improvement below).

So, I had a look on a 24-way system and results are as follows.

1. tee (500% win)

It performs tee/splice of a huge size along with 16 byte reads from
another worker.

On a kernel without the change this results in significant lock
contention as the writer keeps being woken up, whereas with the change
the bench gets to issue multiple reads without being bothered (as long
as there is any data in the buf).

For a real program this is more of a "don't do that" kind of deal imo,
so I don't think this particular win translates to a real-world benefit.
(modulo shite progs)

2. hackbench (7.5% loss)

lkp folk roll with 128-way system and invoke:
/usr/bin/hackbench -g 64 -f 20 --process --pipe -l 60000 -s 100

Oleg did not specify his spec, I'm guessing 4 cores * 2 threads given:
hackbench -g 4 -f 10 --process --pipe -l 50000 -s 100

I presume the -g parameter got scaled down appropriately.

So I ran this on 24 cores like so:
hackbench -g 12 -f 20 --process --pipe -l 60000 -s 100

to match

This is spawning a massive number of workers (480 in my case!) and there
is tons of lock contention (go figure).

I got a *massive* real time difference:
23.63s user 270.20s system 2312% cpu 12.71s (12.706) # without the patch
30.40s user 406.97s system 2293% cpu 19.07s (19.069) # with the patch

According to perf there is a significant increase in time spent
performing wake ups by the writer, while the reader is spending more
time going off/on cpu in the read routine.

I think this makes sense -- on a kernel without the patch you are more
likely to get extra data before you drain the buffer.

As for the specific difference in performance, per the above all this is
massively contended and the wake ups are going to alter which locks are
contended to what extent at different scales.

I fully concede I'm not all confident hackbench is doing anything
realistic, but I'll also note waking up a writer to get more data seems
to make sense.

To sum up, the win was registered for something which real programs
should not be doing. Seeing the loss in other benches, I think it would
be best to just drop the patch altogether.