On Mon, Jan 20, 2025 at 02:57:21PM +0800, kernel test robot wrote: > we reported > "[brauner-vfs:vfs-6.14.misc] [pipe_read] aaec5a95d5: hackbench.throughput 7.5% regression" > in > https://lore.kernel.org/all/202501101015.90874b3a-lkp@xxxxxxxxx/ > but seems both you and Christian Brauner think it could be ignored. > > now we captured a regression in another test case. since e.g. there are > something like below in perf data, not sure if it could supply any useful > information? just FYI. sorry if it's still not with big value. > > > 9.45 -6.3 3.13 ± 9% perf-profile.calltrace.cycles-pp.pipe_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe > ... > 10.00 -6.5 3.53 ± 9% perf-profile.children.cycles-pp.pipe_read > 2.34 -1.3 1.07 ± 9% perf-profile.children.cycles-pp.pipe_poll > > Whatever the long term fate of the patch I think it would be prudent to skip it in this merge window. First two notes: 1. the change only considers performing a wake up if the current source buf got depleted -- if there is a blocked writer and there is at least one byte in the current buf nothing happens, which is where the difference in results is coming from 2. stress-ng is not really a microbenchmark suite. it is more of a "do stuff, some of which may accidentally line up with isolated behavior from real workloads and return ops/s". plenty of it does not line up with anythinig afaics (spoiler for tee improvement below). So, I had a look on a 24-way system and results are as follows. 1. tee (500% win) It performs tee/splice of a huge size along with 16 byte reads from another worker. On a kernel without the change this results in significant lock contention as the writer keeps being woken up, whereas with the change the bench gets to issue multiple reads without being bothered (as long as there is any data in the buf). For a real program this is more of a "don't do that" kind of deal imo, so I don't think this particular win translates to a real-world benefit. (modulo shite progs) 2. hackbench (7.5% loss) lkp folk roll with 128-way system and invoke: /usr/bin/hackbench -g 64 -f 20 --process --pipe -l 60000 -s 100 Oleg did not specify his spec, I'm guessing 4 cores * 2 threads given: hackbench -g 4 -f 10 --process --pipe -l 50000 -s 100 I presume the -g parameter got scaled down appropriately. So I ran this on 24 cores like so: hackbench -g 12 -f 20 --process --pipe -l 60000 -s 100 to match This is spawning a massive number of workers (480 in my case!) and there is tons of lock contention (go figure). I got a *massive* real time difference: 23.63s user 270.20s system 2312% cpu 12.71s (12.706) # without the patch 30.40s user 406.97s system 2293% cpu 19.07s (19.069) # with the patch According to perf there is a significant increase in time spent performing wake ups by the writer, while the reader is spending more time going off/on cpu in the read routine. I think this makes sense -- on a kernel without the patch you are more likely to get extra data before you drain the buffer. As for the specific difference in performance, per the above all this is massively contended and the wake ups are going to alter which locks are contended to what extent at different scales. I fully concede I'm not all confident hackbench is doing anything realistic, but I'll also note waking up a writer to get more data seems to make sense. To sum up, the win was registered for something which real programs should not be doing. Seeing the loss in other benches, I think it would be best to just drop the patch altogether.