On 8/4/22 10:28 AM, Keith Busch wrote: > On Wed, Aug 03, 2022 at 02:52:11PM -0600, Jens Axboe wrote: >> I ran this on my test box to see how we'd do. First the bad news: >> smaller block size IO seems slower. I ran with QD=8 and used 24 drives, >> and using t/io_uring (with registered buffers, polled, etc) and a 512b >> block size I get: >> >> IOPS=44.36M, BW=21.66GiB/s, IOS/call=1/1 >> IOPS=44.64M, BW=21.80GiB/s, IOS/call=2/2 >> IOPS=44.69M, BW=21.82GiB/s, IOS/call=1/1 >> IOPS=44.55M, BW=21.75GiB/s, IOS/call=2/2 >> IOPS=44.93M, BW=21.94GiB/s, IOS/call=1/1 >> IOPS=44.79M, BW=21.87GiB/s, IOS/call=1/2 >> >> and adding -D1 I get: >> >> IOPS=43.74M, BW=21.36GiB/s, IOS/call=1/1 >> IOPS=44.04M, BW=21.50GiB/s, IOS/call=1/1 >> IOPS=43.63M, BW=21.30GiB/s, IOS/call=2/2 >> IOPS=43.67M, BW=21.32GiB/s, IOS/call=1/1 >> IOPS=43.57M, BW=21.28GiB/s, IOS/call=1/2 >> IOPS=43.53M, BW=21.25GiB/s, IOS/call=2/1 >> >> which does regress that workload. > > Bummer, I would expect -D1 to be no worse. My test isn't nearly as consistent > as yours, so I'm having some trouble measuring. I'm only coming with a few > micro-optimizations that might help. A diff is below on top of this series. I > also created a branch with everything folded in here: That seemed to do the trick! Don't pay any attention to the numbers being slightly different than before for -D0, it's a slightly different kernel. But same test, -d8 -s2 -c2, polled: -D0 -B1 IOPS=45.39M, BW=22.16GiB/s, IOS/call=1/1 IOPS=46.06M, BW=22.49GiB/s, IOS/call=2/1 IOPS=45.70M, BW=22.31GiB/s, IOS/call=1/1 IOPS=45.71M, BW=22.32GiB/s, IOS/call=2/2 IOPS=45.83M, BW=22.38GiB/s, IOS/call=1/1 IOPS=45.64M, BW=22.29GiB/s, IOS/call=2/2 -D1 -B1 IOPS=45.94M, BW=22.43GiB/s, IOS/call=1/1 IOPS=46.08M, BW=22.50GiB/s, IOS/call=1/1 IOPS=46.27M, BW=22.59GiB/s, IOS/call=2/1 IOPS=45.88M, BW=22.40GiB/s, IOS/call=1/1 IOPS=46.18M, BW=22.55GiB/s, IOS/call=2/1 IOPS=46.13M, BW=22.52GiB/s, IOS/call=2/2 IOPS=46.40M, BW=22.66GiB/s, IOS/call=1/1 which is a smidge higher, and definitely not regressing now. -- Jens Axboe