cpus_allowed impact under newer linux kernels

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



```
fio: fio-3.28, fio-3.35, master
linux kernel: 6.8.9 (fedora38)
ioengine: io_uring
```
Please, consider this issue as a question and bear with me as I am trying to understand changes in the io workloads between the kernels - your feedback on the bellow would be greatly appreciated. 

It seems that the impact of the `cpus_allowed` changed, following related cpu affinity changes in the kernel regarding the `iou-{sqp,wrk}-*` threads.  To my understanding, under older kernels, e.g. 6.1,
the `iou-sqp` threads were being pinned to ALL online cpus, where `iou-wrk` ones to online cpus of given numa node (in relation to the placement of the job's thread). This changed, e.g. under 6.8, such that
all these threads are now pinned to cpus listed under `cpus_allowed`. This seems to impact the performance, since for the following minimal `io_uring` setup:

```
[global]
direct=1
thread=1
norandommap=1
group_reporting=1
time_based=1

rw=randread
rwmixread=70
bs=4096
ramp_time=0
numjobs=1
log_avg_msec=0
fixedbufs=1
hipri=1
registerfiles=1
sqthread_poll=1

[filename0]
iodepth=1
cpus_allowed=15  # node1
filename=/dev/nvme0n1 # numa_node=1
```
It drops quite significantly compared to 6.1 kernel:

6.8:
```
 $ fio ./fio.conf --ioengine=io_uring --runtime=5
filename0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=1
fio-3.28
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=504KiB/s][r=126 IOPS][eta 00m:00s]
filename0: (groupid=0, jobs=1): err= 0: pid=6697: Thu May 23 11:34:54 2024
  read: IOPS=125, BW=501KiB/s (513kB/s)(2508KiB/5005msec)
    clat (usec): min=5006, max=13999, avg=7980.23, stdev=955.42
     lat (usec): min=5007, max=13999, avg=7980.41, stdev=955.41
    clat percentiles (usec):
     |  1.00th=[ 6980],  5.00th=[ 6980], 10.00th=[ 6980], 20.00th=[ 7046],
     | 30.00th=[ 7963], 40.00th=[ 7963], 50.00th=[ 7963], 60.00th=[ 8029],
     | 70.00th=[ 8029], 80.00th=[ 8029], 90.00th=[ 8029], 95.00th=[10945],
     | 99.00th=[11076], 99.50th=[11076], 99.90th=[13960], 99.95th=[13960],
     | 99.99th=[13960]
   bw (  KiB/s): min=  488, max=  504, per=99.78%, avg=500.44, stdev= 7.06, samples=9
   iops        : min=  122, max=  126, avg=125.11, stdev= 1.76, samples=9
  lat (msec)   : 10=93.46%, 20=6.54%
  cpu          : usr=49.74%, sys=0.00%, ctx=631, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=627,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=501KiB/s (513kB/s), 501KiB/s-501KiB/s (513kB/s-513kB/s), io=2508KiB (2568kB), run=5005-5005msec

Disk stats (read/write):
  nvme0n1: ios=614/0, merge=0/0, ticks=167/0, in_queue=167, util=12.29%
 ```
cpu affinity:
```
# grep Name /proc/$(pgrep fio | tail -n1)/task/*/status
/proc/6856/task/6856/status:Name:       fio
/proc/6856/task/6857/status:Name:       fio
/proc/6856/task/6858/status:Name:       fio
/proc/6856/task/6859/status:Name:       iou-sqp-6858
# grep Cpu /proc/$(pgrep fio | tail -n1)/task/*/status
/proc/6856/task/6856/status:Cpus_allowed:       3fffffff
/proc/6856/task/6856/status:Cpus_allowed_list:  0-29
/proc/6856/task/6857/status:Cpus_allowed:       3fffffff
/proc/6856/task/6857/status:Cpus_allowed_list:  0-29
/proc/6856/task/6858/status:Cpus_allowed:       00008000
/proc/6856/task/6858/status:Cpus_allowed_list:  15
/proc/6856/task/6859/status:Cpus_allowed:       00008000
/proc/6856/task/6859/status:Cpus_allowed_list:  15 # <- iou-sqp

```
6.1:
```
$ fio ./fio.conf --ioengine=io_uring --runtime=5
filename0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=1
fio-3.28
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=102MiB/s][r=26.0k IOPS][eta 00m:00s]
filename0: (groupid=0, jobs=1): err= 0: pid=3356: Thu May 23 11:37:17 2024
  Description  : ["Test io_plugin=kernel-io-uring Blocksize=4096 Workload=randread MIX=70 qd=1024"]
  read: IOPS=25.9k, BW=101MiB/s (106MB/s)(506MiB/5001msec)
    clat (usec): min=32, max=26382, avg=38.22, stdev=74.23
     lat (usec): min=33, max=26385, avg=38.32, stdev=74.24
    clat percentiles (usec):
     |  1.00th=[   35],  5.00th=[   36], 10.00th=[   36], 20.00th=[   37],
     | 30.00th=[   37], 40.00th=[   37], 50.00th=[   37], 60.00th=[   38],
     | 70.00th=[   38], 80.00th=[   39], 90.00th=[   41], 95.00th=[   47],
     | 99.00th=[   65], 99.50th=[   68], 99.90th=[   87], 99.95th=[   94],
     | 99.99th=[  155]
   bw (  KiB/s): min=92384, max=106776, per=99.84%, avg=103344.00, stdev=4395.69, samples=9
   iops        : min=23096, max=26694, avg=25836.00, stdev=1098.92, samples=9
  lat (usec)   : 50=96.37%, 100=3.60%, 250=0.03%, 500=0.01%, 750=0.01%
  lat (msec)   : 4=0.01%, 50=0.01%
  cpu          : usr=99.50%, sys=0.00%, ctx=7, majf=0, minf=20
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=129413,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=101MiB/s (106MB/s), 101MiB/s-101MiB/s (106MB/s-106MB/s), io=506MiB (530MB), run=5001-5001msec

Disk stats (read/write):
  nvme0n1: ios=126698/0, merge=0/0, ticks=4649/0, in_queue=4649, util=97.58%
```
cpu affinity:
```
# grep Name /proc/$(pgrep fio | tail -n1)/task/*/status
/proc/3361/task/3361/status:Name:       fio
/proc/3361/task/3371/status:Name:       fio
/proc/3361/task/3372/status:Name:       fio
/proc/3361/task/3376/status:Name:       iou-sqp-3372
# grep Cpu /proc/$(pgrep fio | tail -n1)/task/*/status
/proc/3361/task/3361/status:Cpus_allowed:       3fffffff
/proc/3361/task/3361/status:Cpus_allowed_list:  0-29
/proc/3361/task/3371/status:Cpus_allowed:       3fffffff
/proc/3361/task/3371/status:Cpus_allowed_list:  0-29
/proc/3361/task/3372/status:Cpus_allowed:       00008000
/proc/3361/task/3372/status:Cpus_allowed_list:  15
/proc/3361/task/3376/status:Cpus_allowed:       3fffffff
/proc/3361/task/3376/status:Cpus_allowed_list:  0-29 # <- iou-sqp

```
In order to align the results between the kernels, setting a separate cpu (keeping it under the same node) for the sqpoll thread under 6.8 seems to help:
```
filename0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=1
fio-3.28
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=120MiB/s][r=30.8k IOPS][eta 00m:00s]
filename0: (groupid=0, jobs=1): err= 0: pid=6912: Thu May 23 11:47:13 2024
  read: IOPS=30.5k, BW=119MiB/s (125MB/s)(596MiB/5001msec)
    clat (usec): min=4, max=3588, avg=32.37, stdev= 9.64
     lat (usec): min=29, max=3590, avg=32.46, stdev= 9.65
    clat percentiles (usec):
     |  1.00th=[   31],  5.00th=[   31], 10.00th=[   32], 20.00th=[   32],
     | 30.00th=[   32], 40.00th=[   32], 50.00th=[   32], 60.00th=[   33],
     | 70.00th=[   33], 80.00th=[   34], 90.00th=[   34], 95.00th=[   35],
     | 99.00th=[   44], 99.50th=[   50], 99.90th=[   78], 99.95th=[   82],
     | 99.99th=[  106]
   bw (  KiB/s): min=117800, max=123400, per=100.00%, avg=122219.56, stdev=1871.89, samples=9
   iops        : min=29450, max=30850, avg=30554.89, stdev=467.97, samples=9
  lat (usec)   : 10=0.01%, 50=99.50%, 100=0.48%, 250=0.01%, 500=0.01%
  lat (msec)   : 4=0.01%
  cpu          : usr=99.70%, sys=0.00%, ctx=15, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=152702,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=119MiB/s (125MB/s), 119MiB/s-119MiB/s (125MB/s-125MB/s), io=596MiB (625MB), run=5001-5001msec

Disk stats (read/write):
  nvme0n1: ios=149569/0, merge=0/0, ticks=4662/0, in_queue=4662, util=98.06%
```
cpu affinity:
```
# grep Name /proc/$(pgrep fio | tail -n1)/task/*/status
/proc/6914/task/6914/status:Name:       fio
/proc/6914/task/6915/status:Name:       fio
/proc/6914/task/6916/status:Name:       fio
/proc/6914/task/6917/status:Name:       iou-sqp-6916
# grep Cpu /proc/$(pgrep fio | tail -n1)/task/*/status
/proc/6914/task/6914/status:Cpus_allowed:       3fffffff
/proc/6914/task/6914/status:Cpus_allowed_list:  0-29
/proc/6914/task/6915/status:Cpus_allowed:       3fffffff
/proc/6914/task/6915/status:Cpus_allowed_list:  0-29
/proc/6914/task/6916/status:Cpus_allowed:       00008000
/proc/6914/task/6916/status:Cpus_allowed_list:  15
/proc/6914/task/6917/status:Cpus_allowed:       00010000
/proc/6914/task/6917/status:Cpus_allowed_list:  16 # <- iou-sqp
```

Essentially, when the `io-sqp` thread is kept on the same cpu as job's thread, the cpu load seems to be split in half. So it would seem that setting separate cpu for these threads is the way to go.

However, for a bare-metal setups, simply selecting a separate cpu for the sqpoll thread doesn't help to align the results between 6.1, 6.8 fully. here's an example of a setup with 22 nvmes.
Each job is pinned to a cpu|nvme matching the same numa node placement, smt is enabled, the `sqthread_poll_cpu` is set to a sibling thread of the `cpus_allowed` cpu (so essentially each job should be bound
to a single physical core). Each nvme block device has the following config: nomerges=2,iostats=0,rq_affinity=0,io_poll_delay=-1.

```
[global]
direct=1
thread=1
norandommap=1
group_reporting=1
time_based=1

rw=randread
rwmixread=70
bs=4096
ramp_time=0
numjobs=1
log_avg_msec=0
fixedbufs=1
hipri=1
registerfiles=1
sqthread_poll=1

[filename0]
iodepth=2048
cpus_allowed=0 
sqthread_poll_cpu=56
filename=/dev/nvme0n1 
filename=/dev/nvme1n1

[filename1]
iodepth=2048
cpus_allowed=1 
sqthread_poll_cpu=57
filename=/dev/nvme2n1 
filename=/dev/nvme6n1 

[filename2]
iodepth=2048
cpus_allowed=2 
sqthread_poll_cpu=58
filename=/dev/nvme8n1 
filename=/dev/nvme9n1 
[filename3]
iodepth=2048
cpus_allowed=3
sqthread_poll_cpu=59
filename=/dev/nvme10n1 
filename=/dev/nvme12n1 

[filename4]
iodepth=2048
cpus_allowed=4 
sqthread_poll_cpu=60
filename=/dev/nvme11n1 
filename=/dev/nvme17n1 

[filename5]
iodepth=2048
cpus_allowed=5 
sqthread_poll_cpu=61
filename=/dev/nvme3n1 
filename=/dev/nvme4n1 

[filename6]
iodepth=2048
cpus_allowed=28 
sqthread_poll_cpu=84
filename=/dev/nvme5n1 
filename=/dev/nvme7n1 
[filename7]
iodepth=2048
cpus_allowed=29 
sqthread_poll_cpu=85
filename=/dev/nvme13n1 
filename=/dev/nvme14n1 

[filename8]
iodepth=2048
cpus_allowed=30
sqthread_poll_cpu=86
filename=/dev/nvme15n1 
filename=/dev/nvme16n1 

[filename9]
iodepth=2048
cpus_allowed=31
sqthread_poll_cpu=87
filename=/dev/nvme18n1 
filename=/dev/nvme19n1

[filename10]
iodepth=1024
cpus_allowed=32 
sqthread_poll_cpu=88
filename=/dev/nvme20n1 

[filename11]
iodepth=1024
cpus_allowed=33 
sqthread_poll_cpu=89
filename=/dev/nvme21n1 
```

The results look like this:
```
filename0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048
filename1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048
filename2: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048
filename3: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048
filename4: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048
filename5: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048
filename6: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048
filename7: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048
filename8: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048
filename9: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2048
filename10: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=1024
filename11: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=1024
fio-3.37-31-gd3bc
Starting 12 threads
Jobs: 12 (f=22): [r(12)][100.0%][r=49.6GiB/s][r=13.0M IOPS][eta 00m:00s]
filename0: (groupid=0, jobs=12): err= 0: pid=46722: Thu May 23 11:55:55 2024
  Description  : ["Test io_plugin=kernel-io-uring Blocksize=4096 Workload=randread MIX=70 qd=1024"]
  read: IOPS=13.0M, BW=49.6GiB/s (53.2GB/s)(248GiB/5002msec)
    clat (usec): min=289, max=3547, avg=1733.33, stdev=463.63
     lat (usec): min=290, max=3547, avg=1733.41, stdev=463.63
    clat percentiles (usec):
     |  1.00th=[  693],  5.00th=[  783], 10.00th=[  832], 20.00th=[ 1713],
     | 30.00th=[ 1811], 40.00th=[ 1844], 50.00th=[ 1860], 60.00th=[ 1893],
     | 70.00th=[ 1926], 80.00th=[ 2024], 90.00th=[ 2180], 95.00th=[ 2278],
     | 99.00th=[ 2442], 99.50th=[ 2507], 99.90th=[ 2606], 99.95th=[ 2671],
     | 99.99th=[ 2900]
   bw (  MiB/s): min=50324, max=50952, per=100.00%, avg=50769.42, stdev=15.79, samples=108
   iops        : min=12883193, max=13043864, avg=12996971.44, stdev=4041.62, samples=108
  lat (usec)   : 500=0.03%, 750=3.06%, 1000=15.14%
  lat (msec)   : 2=60.21%, 4=21.56%
  cpu          : usr=99.61%, sys=0.01%, ctx=127, majf=0, minf=5453
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=64977031,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: bw=49.6GiB/s (53.2GB/s), 49.6GiB/s-49.6GiB/s (53.2GB/s-53.2GB/s), io=248GiB (266GB), run=5002-5002msec

Disk stats (read/write):
  nvme0n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme1n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme2n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme6n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme8n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme9n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme10n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme12n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme11n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme17n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme3n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme4n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme5n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme7n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme13n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme14n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme15n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme16n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme18n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme19n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme20n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme21n1: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
```
We cap at ~13M IOPS (without the separate cpu for the poll threads we cap at < 1M IOPS). However, the same setup (without the `sqthread_poll_cpu`) under the 6.1 kernel yields
~20M IOPS. 

The interesting part is that, under 6.8, if we move the sqpoll thread to a completely different core (see note above about the smt) then the results go back to ~20M IOPS.

So I guess my question is if all the above is expected and indeed users should start using separate cpu cores for the sqpoll threads under newer kernels? 

Another observation is that number of these sqpoll threads is bound to `numjobs`, however, there's no way to define separate cpus for separate sqpoll threads. So, for instance,
`numjobs=4` would create 4 sqpoll threads bound to a single `sqthread_poll_cpu` for given job. I am guessing this too may impact the performance ins some way, especially
considering that under older kernels these threads were pinned to every online cpu.

Regards,
Michal






[Index of Archives]     [Linux Kernel]     [Linux SCSI]     [Linux IDE]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux