``` fio: fio-3.28, fio-3.35, master linux kernel: 6.8.9 (fedora38) ioengine: io_uring ``` Please, consider this issue as a question and bear with me as I am trying to understand changes in the io workloads between the kernels - your feedback on the bellow would be greatly appreciated. It seems that the impact of the `cpus_allowed` changed, following related cpu affinity changes in the kernel regarding the `iou-{sqp,wrk}-*` threads. To my understanding, under older kernels, e.g. 6.1, the `iou-sqp` threads were being pinned to ALL online cpus, where `iou-wrk` ones to online cpus of given numa node (in relation to the placement of the job's thread). This changed, e.g. under 6.8, such that all these threads are now pinned to cpus listed under `cpus_allowed`. This seems to impact the performance, since for the following minimal `io_uring` setup: ``` [global] direct=1 thread=1 norandommap=1 group_reporting=1 time_based=1 rw=randread rwmixread=70 bs=4096 ramp_time=0 numjobs=1 log_avg_msec=0 fixedbufs=1 hipri=1 registerfiles=1 sqthread_poll=1 [filename0] iodepth=1 cpus_allowed=15 # node1 filename=/dev/nvme0n1 # numa_node=1 ``` It drops quite significantly compared to 6.1 kernel: 6.8: ``` $ fio ./fio.conf --ioengine=io_uring --runtime=5 filename0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=1 fio-3.28 Starting 1 thread Jobs: 1 (f=1): [r(1)][100.0%][r=504KiB/s][r=126 IOPS][eta 00m:00s] filename0: (groupid=0, jobs=1): err= 0: pid=6697: Thu May 23 11:34:54 2024 read: IOPS=125, BW=501KiB/s (513kB/s)(2508KiB/5005msec) clat (usec): min=5006, max=13999, avg=7980.23, stdev=955.42 lat (usec): min=5007, max=13999, avg=7980.41, stdev=955.41 clat percentiles (usec): | 1.00th=[ 6980], 5.00th=[ 6980], 10.00th=[ 6980], 20.00th=[ 7046], | 30.00th=[ 7963], 40.00th=[ 7963], 50.00th=[ 7963], 60.00th=[ 8029], | 70.00th=[ 8029], 80.00th=[ 8029], 90.00th=[ 8029], 95.00th=[10945], | 99.00th=[11076], 99.50th=[11076], 99.90th=[13960], 99.95th=[13960], | 99.99th=[13960] bw ( KiB/s): min= 488, max= 504, per=99.78%, avg=500.44, stdev= 7.06, samples=9 iops : min= 122, max= 126, avg=125.11, stdev= 1.76, samples=9 lat (msec) : 10=93.46%, 20=6.54% cpu : usr=49.74%, sys=0.00%, ctx=631, majf=0, minf=10 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=627,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=501KiB/s (513kB/s), 501KiB/s-501KiB/s (513kB/s-513kB/s), io=2508KiB (2568kB), run=5005-5005msec Disk stats (read/write): nvme0n1: ios=614/0, merge=0/0, ticks=167/0, in_queue=167, util=12.29% ``` cpu affinity: ``` # grep Name /proc/$(pgrep fio | tail -n1)/task/*/status /proc/6856/task/6856/status:Name: fio /proc/6856/task/6857/status:Name: fio /proc/6856/task/6858/status:Name: fio /proc/6856/task/6859/status:Name: iou-sqp-6858 # grep Cpu /proc/$(pgrep fio | tail -n1)/task/*/status /proc/6856/task/6856/status:Cpus_allowed: 3fffffff /proc/6856/task/6856/status:Cpus_allowed_list: 0-29 /proc/6856/task/6857/status:Cpus_allowed: 3fffffff /proc/6856/task/6857/status:Cpus_allowed_list: 0-29 /proc/6856/task/6858/status:Cpus_allowed: 00008000 /proc/6856/task/6858/status:Cpus_allowed_list: 15 /proc/6856/task/6859/status:Cpus_allowed: 00008000 /proc/6856/task/6859/status:Cpus_allowed_list: 15 # <- iou-sqp ``` 6.1: ``` $ fio ./fio.conf --ioengine=io_uring --runtime=5 filename0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=1 fio-3.28 Starting 1 thread Jobs: 1 (f=1): [r(1)][100.0%][r=102MiB/s][r=26.0k IOPS][eta 00m:00s] filename0: (groupid=0, jobs=1): err= 0: pid=3356: Thu May 23 11:37:17 2024 Description : ["Test io_plugin=kernel-io-uring Blocksize=4096 Workload=randread MIX=70 qd=1024"] read: IOPS=25.9k, BW=101MiB/s (106MB/s)(506MiB/5001msec) clat (usec): min=32, max=26382, avg=38.22, stdev=74.23 lat (usec): min=33, max=26385, avg=38.32, stdev=74.24 clat percentiles (usec): | 1.00th=[ 35], 5.00th=[ 36], 10.00th=[ 36], 20.00th=[ 37], | 30.00th=[ 37], 40.00th=[ 37], 50.00th=[ 37], 60.00th=[ 38], | 70.00th=[ 38], 80.00th=[ 39], 90.00th=[ 41], 95.00th=[ 47], | 99.00th=[ 65], 99.50th=[ 68], 99.90th=[ 87], 99.95th=[ 94], | 99.99th=[ 155] bw ( KiB/s): min=92384, max=106776, per=99.84%, avg=103344.00, stdev=4395.69, samples=9 iops : min=23096, max=26694, avg=25836.00, stdev=1098.92, samples=9 lat (usec) : 50=96.37%, 100=3.60%, 250=0.03%, 500=0.01%, 750=0.01% lat (msec) : 4=0.01%, 50=0.01% cpu : usr=99.50%, sys=0.00%, ctx=7, majf=0, minf=20 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=129413,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=101MiB/s (106MB/s), 101MiB/s-101MiB/s (106MB/s-106MB/s), io=506MiB (530MB), run=5001-5001msec Disk stats (read/write): nvme0n1: ios=126698/0, merge=0/0, ticks=4649/0, in_queue=4649, util=97.58% ``` cpu affinity: ``` # grep Name /proc/$(pgrep fio | tail -n1)/task/*/status /proc/3361/task/3361/status:Name: fio /proc/3361/task/3371/status:Name: fio /proc/3361/task/3372/status:Name: fio /proc/3361/task/3376/status:Name: iou-sqp-3372 # grep Cpu /proc/$(pgrep fio | tail -n1)/task/*/status /proc/3361/task/3361/status:Cpus_allowed: 3fffffff /proc/3361/task/3361/status:Cpus_allowed_list: 0-29 /proc/3361/task/3371/status:Cpus_allowed: 3fffffff /proc/3361/task/3371/status:Cpus_allowed_list: 0-29 /proc/3361/task/3372/status:Cpus_allowed: 00008000 /proc/3361/task/3372/status:Cpus_allowed_list: 15 /proc/3361/task/3376/status:Cpus_allowed: 3fffffff /proc/3361/task/3376/status:Cpus_allowed_list: 0-29 # <- iou-sqp ``` In order to align the results between the kernels, setting a separate cpu (keeping it under the same node) for the sqpoll thread under 6.8 seems to help: ``` filename0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=1 fio-3.28 Starting 1 thread Jobs: 1 (f=1): [r(1)][100.0%][r=120MiB/s][r=30.8k IOPS][eta 00m:00s] filename0: (groupid=0, jobs=1): err= 0: pid=6912: Thu May 23 11:47:13 2024 read: IOPS=30.5k, BW=119MiB/s (125MB/s)(596MiB/5001msec) clat (usec): min=4, max=3588, avg=32.37, stdev= 9.64 lat (usec): min=29, max=3590, avg=32.46, stdev= 9.65 clat percentiles (usec): | 1.00th=[ 31], 5.00th=[ 31], 10.00th=[ 32], 20.00th=[ 32], | 30.00th=[ 32], 40.00th=[ 32], 50.00th=[ 32], 60.00th=[ 33], | 70.00th=[ 33], 80.00th=[ 34], 90.00th=[ 34], 95.00th=[ 35], | 99.00th=[ 44], 99.50th=[ 50], 99.90th=[ 78], 99.95th=[ 82], | 99.99th=[ 106] bw ( KiB/s): min=117800, max=123400, per=100.00%, avg=122219.56, stdev=1871.89, samples=9 iops : min=29450, max=30850, avg=30554.89, stdev=467.97, samples=9 lat (usec) : 10=0.01%, 50=99.50%, 100=0.48%, 250=0.01%, 500=0.01% lat (msec) : 4=0.01% cpu : usr=99.70%, sys=0.00%, ctx=15, majf=0, minf=12 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=152702,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=119MiB/s (125MB/s), 119MiB/s-119MiB/s (125MB/s-125MB/s), io=596MiB (625MB), run=5001-5001msec Disk stats (read/write): nvme0n1: ios=149569/0, merge=0/0, ticks=4662/0, in_queue=4662, util=98.06% ``` cpu affinity: ``` # grep Name /proc/$(pgrep fio | tail -n1)/task/*/status /proc/6914/task/6914/status:Name: fio /proc/6914/task/6915/status:Name: fio /proc/6914/task/6916/status:Name: fio /proc/6914/task/6917/status:Name: iou-sqp-6916 # grep Cpu /proc/$(pgrep fio | tail -n1)/task/*/status /proc/6914/task/6914/status:Cpus_allowed: 3fffffff /proc/6914/task/6914/status:Cpus_allowed_list: 0-29 /proc/6914/task/6915/status:Cpus_allowed: 3fffffff /proc/6914/task/6915/status:Cpus_allowed_list: 0-29 /proc/6914/task/6916/status:Cpus_allowed: 00008000 /proc/6914/task/6916/status:Cpus_allowed_list: 15 /proc/6914/task/6917/status:Cpus_allowed: 00010000 /proc/6914/task/6917/status:Cpus_allowed_list: 16 # <- iou-sqp ``` Essentially, when the `io-sqp` thread is kept on the same cpu as job's thread, the cpu load seems to be split in half. So it would seem that setting separate cpu for these threads is the way to go. However, for a bare-metal setups, simply selecting a separate cpu for the sqpoll thread doesn't help to align the results between 6.1, 6.8 fully. here's an example of a setup with 22 nvmes. Each job is pinned to a cpu|nvme matching the same numa node placement, smt is enabled, the `sqthread_poll_cpu` is set to a sibling thread of the `cpus_allowed` cpu (so essentially each job should be bound to a single physical core). Each nvme block device has the following config: nomerges=2,iostats=0,rq_affinity=0,io_poll_delay=-1. ``` [global] direct=1 thread=1 norandommap=1 group_reporting=1 time_based=1 rw=randread rwmixread=70 bs=4096 ramp_time=0 numjobs=1 log_avg_msec=0 fixedbufs=1 hipri=1 registerfiles=1 sqthread_poll=1 [filename0] iodepth=2048 cpus_allowed=0 sqthread_poll_cpu=56 filename=/dev/nvme0n1 filename=/dev/nvme1n1 [filename1] iodepth=2048 cpus_allowed=1 sqthread_poll_cpu=57 filename=/dev/nvme2n1 filename=/dev/nvme6n1 [filename2] iodepth=2048 cpus_allowed=2 sqthread_poll_cpu=58 filename=/dev/nvme8n1 filename=/dev/nvme9n1 [filename3] iodepth=2048 cpus_allowed=3 sqthread_poll_cpu=59 filename=/dev/nvme10n1 filename=/dev/nvme12n1 [filename4] iodepth=2048 cpus_allowed=4 sqthread_poll_cpu=60 filename=/dev/nvme11n1 filename=/dev/nvme17n1 [filename5] iodepth=2048 cpus_allowed=5 sqthread_poll_cpu=61 filename=/dev/nvme3n1 filename=/dev/nvme4n1 [filename6] iodepth=2048 cpus_allowed=28 sqthread_poll_cpu=84 filename=/dev/nvme5n1 filename=/dev/nvme7n1 [filename7] iodepth=2048 cpus_allowed=29 sqthread_poll_cpu=85 filename=/dev/nvme13n1 filename=/dev/nvme14n1 [filename8] iodepth=2048 cpus_allowed=30 sqthread_poll_cpu=86 filename=/dev/nvme15n1 filename=/dev/nvme16n1 [filename9] iodepth=2048 cpus_allowed=31 sqthread_poll_cpu=87 filename=/dev/nvme18n1 filename=/dev/nvme19n1 [filename10] iodepth=1024 cpus_allowed=32 sqthread_poll_cpu=88 filename=/dev/nvme20n1 [filename11] iodepth=1024 cpus_allowed=33 sqthread_poll_cpu=89 filename=/dev/nvme21n1 ``` The results look like this: ``` filename0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048 filename1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048 filename2: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048 filename3: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048 filename4: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048 filename5: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048 filename6: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048 filename7: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048 filename8: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048 filename9: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048 filename10: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=1024 filename11: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=1024 fio-3.37-31-gd3bc Starting 12 threads Jobs: 12 (f=22): [r(12)][100.0%][r=49.6GiB/s][r=13.0M IOPS][eta 00m:00s] filename0: (groupid=0, jobs=12): err= 0: pid=46722: Thu May 23 11:55:55 2024 Description : ["Test io_plugin=kernel-io-uring Blocksize=4096 Workload=randread MIX=70 qd=1024"] read: IOPS=13.0M, BW=49.6GiB/s (53.2GB/s)(248GiB/5002msec) clat (usec): min=289, max=3547, avg=1733.33, stdev=463.63 lat (usec): min=290, max=3547, avg=1733.41, stdev=463.63 clat percentiles (usec): | 1.00th=[ 693], 5.00th=[ 783], 10.00th=[ 832], 20.00th=[ 1713], | 30.00th=[ 1811], 40.00th=[ 1844], 50.00th=[ 1860], 60.00th=[ 1893], | 70.00th=[ 1926], 80.00th=[ 2024], 90.00th=[ 2180], 95.00th=[ 2278], | 99.00th=[ 2442], 99.50th=[ 2507], 99.90th=[ 2606], 99.95th=[ 2671], | 99.99th=[ 2900] bw ( MiB/s): min=50324, max=50952, per=100.00%, avg=50769.42, stdev=15.79, samples=108 iops : min=12883193, max=13043864, avg=12996971.44, stdev=4041.62, samples=108 lat (usec) : 500=0.03%, 750=3.06%, 1000=15.14% lat (msec) : 2=60.21%, 4=21.56% cpu : usr=99.61%, sys=0.01%, ctx=127, majf=0, minf=5453 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=64977031,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1024 Run status group 0 (all jobs): READ: bw=49.6GiB/s (53.2GB/s), 49.6GiB/s-49.6GiB/s (53.2GB/s-53.2GB/s), io=248GiB (266GB), run=5002-5002msec Disk stats (read/write): nvme0n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme1n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme2n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme6n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme8n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme9n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme10n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme12n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme11n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme17n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme3n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme4n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme5n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme7n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme13n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme14n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme15n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme16n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme18n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme19n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme20n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme21n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% ``` We cap at ~13M IOPS (without the separate cpu for the poll threads we cap at < 1M IOPS). However, the same setup (without the `sqthread_poll_cpu`) under the 6.1 kernel yields ~20M IOPS. The interesting part is that, under 6.8, if we move the sqpoll thread to a completely different core (see note above about the smt) then the results go back to ~20M IOPS. So I guess my question is if all the above is expected and indeed users should start using separate cpu cores for the sqpoll threads under newer kernels? Another observation is that number of these sqpoll threads is bound to `numjobs`, however, there's no way to define separate cpus for separate sqpoll threads. So, for instance, `numjobs=4` would create 4 sqpoll threads bound to a single `sqthread_poll_cpu` for given job. I am guessing this too may impact the performance ins some way, especially considering that under older kernels these threads were pinned to every online cpu. Regards, Michal